Analyzing support for transgenders’ rights across the EU

Survey Research Methodology II Assignment

Loading libraries

rm(list = ls())
library(MASS)
library(performance)
library(tidyverse)
library(haven)
library(readxl)
library(xml2)
library(rvest)
library(janitor)
library(DataExplorer)
library(countrycode)
library(explore)
library(lme4)
library(lmerTest)
library(broom.mixed)
library(ggplot2)
library(e1071)
library(dplyr)
library(scales)
library(tidyr)
library(kableExtra)
library(gmodels)
library(flexplot)
library(corrplot)
library(DescTools)
library(caret)
library(ROSE)
library(randomForest)
library(pROC)
library(mice)
library(glmnet)
library(fastDummies)
library(rempsyc)
library(gbm)

Loading data

Survey data

data_raw <- read_dta("data/ZA7575.dta")

Country-level datasets

This dataframe serves us to join various country-level datasets together later.

# building a df with country names and codes for the EU-28
codelist <- countrycode::codelist |> 
  select(country.name.en, iso.name.en, un.name.en, cow.name, ecb, eurostat, iso2c, iso3c, eu28) |> 
  filter(!is.na(eu28)) 

GDP per capita

This is GDP per capita, PPP (constant 2021 international $). Retrieved from the World Bank (https://data.worldbank.org/indicator/NY.GDP.PCAP.PP.KD) for the year 2019.

gdp_data <- read_csv("data/gdp_pc_ppp_2021.csv", skip = 4, show_col_types = FALSE)

head(gdp_data)
## # A tibble: 6 × 69
##   `Country Name`  `Country Code` `Indicator Name` `Indicator Code` `1960` `1961`
##   <chr>           <chr>          <chr>            <chr>            <lgl>  <lgl> 
## 1 Aruba           ABW            GDP per capita,… NY.GDP.PCAP.PP.… NA     NA    
## 2 Africa Eastern… AFE            GDP per capita,… NY.GDP.PCAP.PP.… NA     NA    
## 3 Afghanistan     AFG            GDP per capita,… NY.GDP.PCAP.PP.… NA     NA    
## 4 Africa Western… AFW            GDP per capita,… NY.GDP.PCAP.PP.… NA     NA    
## 5 Angola          AGO            GDP per capita,… NY.GDP.PCAP.PP.… NA     NA    
## 6 Albania         ALB            GDP per capita,… NY.GDP.PCAP.PP.… NA     NA    
## # ℹ 63 more variables: `1962` <lgl>, `1963` <lgl>, `1964` <lgl>, `1965` <lgl>,
## #   `1966` <lgl>, `1967` <lgl>, `1968` <lgl>, `1969` <lgl>, `1970` <lgl>,
## #   `1971` <lgl>, `1972` <lgl>, `1973` <lgl>, `1974` <lgl>, `1975` <lgl>,
## #   `1976` <lgl>, `1977` <lgl>, `1978` <lgl>, `1979` <lgl>, `1980` <lgl>,
## #   `1981` <lgl>, `1982` <lgl>, `1983` <lgl>, `1984` <lgl>, `1985` <lgl>,
## #   `1986` <lgl>, `1987` <lgl>, `1988` <lgl>, `1989` <lgl>, `1990` <dbl>,
## #   `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, …
gdp_data <- gdp_data |> 
  select(`Country Name`, `Country Code`, `2019`)


gdp_data <- gdp_data |> 
  rename(country_name = `Country Name`,
         country_code = `Country Code`,
         gdp_pc_ppp = `2019`)

gdp_data <- gdp_data |> 
inner_join(select(codelist, iso3c), by = c("country_code" = "iso3c"))

str(gdp_data)
## tibble [28 × 3] (S3: tbl_df/tbl/data.frame)
##  $ country_name: chr [1:28] "Austria" "Belgium" "Bulgaria" "Cyprus" ...
##  $ country_code: chr [1:28] "AUT" "BEL" "BGR" "CYP" ...
##  $ gdp_pc_ppp  : num [1:28] 64630 60452 27673 46157 47720 ...

We now have 28 observations including the UK.

LGBT rights

Found this LGBT rights index on Our World in Data (https://ourworldindata.org/grapher/lgbt-rights-index) which captures whether LGBT+ people enjoy the same rights as cisgender people combining information on 18 different policies. It includes the legal status of same-sex marriage.

lgbt_rights_index <- read.csv("data/lgbt-rights-index.csv")

lgbt_rights_index <- lgbt_rights_index |> 
  filter(Year == 2019) |> 
  inner_join(select(codelist, iso3c), by = c("Code" = "iso3c")) |> 
  select(-Year) # all observations are from 2019

str(lgbt_rights_index)
## 'data.frame':    28 obs. of  3 variables:
##  $ Entity            : chr  "Austria" "Belgium" "Bulgaria" "Croatia" ...
##  $ Code              : chr  "AUT" "BEL" "BGR" "HRV" ...
##  $ LGBT..Policy.Index: num  8.92 10 4.94 6.96 3.95 ...

Column names can be renamed:

lgbt_rights_index <- lgbt_rights_index |> 
  rename(country_name = Entity,
         country_code = Code,
         lgbt_policy_index = LGBT..Policy.Index) 

Gender inequality index

Gender Development and Gender Inequality indexes, developed by the United Nations Development. Retrieved from https://hdr.undp.org/data-center/documentation-and-downloads for the year 2019.

gender_index <- read_xlsx("data/UNDP_gender_indexes.xlsx")

gender_index <- gender_index |>
  select(-dimension, -note, -year) |> # empty/useless columns
  inner_join(select(codelist, iso3c), by = c("countryIsoCode" = "iso3c"))

gender_index |> 
  distinct(country) |> 
  nrow()
## [1] 28
str(gender_index)
## tibble [560 × 7] (S3: tbl_df/tbl/data.frame)
##  $ countryIsoCode: chr [1:560] "AUT" "AUT" "AUT" "AUT" ...
##  $ country       : chr [1:560] "Austria" "Austria" "Austria" "Austria" ...
##  $ indexCode     : chr [1:560] "GII" "GDI" "GDI" "GDI" ...
##  $ index         : chr [1:560] "Gender Inequality Index" "Gender Development Index" "Gender Development Index" "Gender Development Index" ...
##  $ indicatorCode : chr [1:560] "abr" "eys_f" "eys_m" "gdi" ...
##  $ indicator     : chr [1:560] "Adolescent Birth Rate (births per 1,000 women ages 15-19)" "Expected Years of Schooling, female (years)" "Expected Years of Schooling, male (years)" "Gender Development Index (value)" ...
##  $ value         : num [1:560] 5.499 16.451 15.725 0.969 0.053 ...

We checked that we have indeed 28 distinct countries in the dataset.

This dataset collects many indicators apart from the indexes values. We are selecting only the Gender Development and Gender Inequality indexes (GDI and GII):

gender_index <- gender_index |> 
  filter(indicatorCode %in% c("gdi", "gii")) |> 
  select(-c(indexCode, indicatorCode, indicator)) |> # removing redundant columns
  pivot_wider(names_from = index, values_from = value)

colnames(gender_index) <- janitor::make_clean_names(colnames(gender_index)) # cleaning spaces and upper cases

And then exploring them to see which one is a better fit for modeling and predictions:

summary(gender_index$gender_inequality_index)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0160  0.0555  0.0890  0.1067  0.1335  0.2390
summary(gender_index$gender_development_index)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9560  0.9725  0.9855  0.9856  0.9918  1.0230
sd(gender_index$gender_inequality_index)
## [1] 0.06664423
sd(gender_index$gender_development_index)
## [1] 0.0167627

We observe that the second one (Gender Development Index) has a very small range, indicating that most countries have nearly identical scores. This is supported by the standard deviation, which shows that the GII is more spread out than the GDI. Being almost constant, GDI won’t add much value to the analysis. This is likely due to the fact that these indexes are created by the United Nations Development Programme for all countries, so it may not capture the finer differences between EU countries.

cor(gender_index$gender_inequality_index, 
    gender_index$gender_development_index, 
    use = "complete.obs")
## [1] 0.1750589

Surprisingly, the two indexes have a very weak positive correlation. While they are not inverse, we interpreted them as measuring opposite things, and had expected a negative correlation. Maybe the correlation coefficient is not useful here because of the low variation of both variables (specially GDI), so ultimately we chose to keep GII and discard GDI.

gender_index <- gender_index |> 
  select(-gender_development_index)

Economist’s Democracy Index

democracy_index <- read_xlsx("data/EIU_democracy_index.xlsx", sheet = 4)

# the ISO codes were lowercase which impedes the join
democracy_index$geo <- toupper(democracy_index$geo)

# filter for 2019 and EU28 countries
democracy_index <- democracy_index |> 
  filter(time == 2019) |> 
  inner_join((select(codelist, iso3c)), by = c("geo" = "iso3c"))

# clean var names
names(democracy_index) <- names(democracy_index) %>%
  janitor::make_clean_names() %>%
  gsub("_eiu$", "", .)

democracy_index <- democracy_index |> 
  rename(country_code = geo,
         country_name = name,
         year = time)

str(democracy_index)
## tibble [28 × 10] (S3: tbl_df/tbl/data.frame)
##  $ country_code                 : chr [1:28] "AUT" "BEL" "BGR" "HRV" ...
##  $ country_name                 : chr [1:28] "Austria" "Belgium" "Bulgaria" "Croatia" ...
##  $ year                         : num [1:28] 2019 2019 2019 2019 2019 ...
##  $ democracy_index              : num [1:28] 82.9 76.4 70.3 65.7 75.9 76.9 92.2 79 92.5 81.2 ...
##  $ electoral_pluralism_index    : num [1:28] 95.8 95.8 91.7 91.7 91.7 95.8 100 95.8 100 95.8 ...
##  $ government_index             : num [1:28] 78.6 82.1 64.3 60.7 64.3 67.9 92.9 78.6 89.3 78.6 ...
##  $ political_participation_index: num [1:28] 83.3 50 72.2 55.6 66.7 66.7 83.3 66.7 88.9 77.8 ...
##  $ political_culture_index      : num [1:28] 68.8 68.8 43.8 50 68.8 68.8 93.8 68.8 87.5 68.8 ...
##  $ civil_liberties_index        : num [1:28] 88.2 85.3 79.4 70.6 88.2 85.3 91.2 85.3 97.1 85.3 ...
##  $ change_in_democracy_index    : num [1:28] 0 -1.4 0 0 0 ...

In this case, we are only keeping the overall index value.

democracy_index <- democracy_index |> 
  select(country_name, country_code, democracy_index)

Joining all country-level data together

country_level_data <- codelist |> 
  select(iso3c, iso2c) |> 
  left_join(gdp_data, by = c("iso3c" = "country_code")) |> 
  # left_join(rural_data, by = c("iso3c" = "country_code")) |> rural population data will probably be discarded
  left_join(gender_index, by = c("iso3c" = "country_iso_code")) |> 
  left_join(lgbt_rights_index, by = c("iso3c" = "country_code")) |> 
  left_join(democracy_index, by = c("iso3c" = "country_code")) |> 
  select(-contains("country_")) # removing all duplicated country_name columns that were joined from each data frame
str(country_level_data)
## tibble [28 × 7] (S3: tbl_df/tbl/data.frame)
##  $ iso3c                  : chr [1:28] "AUT" "BEL" "BGR" "HRV" ...
##  $ iso2c                  : chr [1:28] "AT" "BE" "BG" "HR" ...
##  $ gdp_pc_ppp             : num [1:28] 64630 60452 27673 35094 46157 ...
##  $ country                : chr [1:28] "Austria" "Belgium" "Bulgaria" "Croatia" ...
##  $ gender_inequality_index: num [1:28] 0.053 0.048 0.205 0.119 0.235 0.124 0.016 0.092 0.031 0.086 ...
##  $ lgbt_policy_index      : num [1:28] 8.92 10 4.94 6.96 3.95 ...
##  $ democracy_index        : num [1:28] 82.9 76.4 70.3 65.7 75.9 76.9 92.2 79 92.5 81.2 ...

Data cleaning

Selecting relevant variables

We are discarding all variables that relate to trade and globalization as deemed not relevant for our analysis qa. We are also discarding variables related to energy policies qb.

We have also not considered questions specifically about Roma ex qc8 and qc14 and qc16


Some of the doubts we had: This section will perform initial analysis of variables and there relationship to the target. A value of 1 on the y-axis represents 100% support for transgender rights to change civil papers, a value of 2 represents 100% opposition to the question. Don’t knows are removed from this section.

AGE

x_df <- data_raw |> filter(qc19<3) |> summarise(mean = mean(qc19), .by = d11)
ggplot(x_df, aes(x = d11, y = mean)) + geom_point() + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

As relationship seems pretty linear we are going to use the continuous variable for age. Variance increases at high ages due to fewer respondents.


POLITICAL OPINIONS

x_df <- data_raw |> filter(qc19<3) |> summarise(mean = mean(qc19), .by = d1)
ggplot(x_df, aes(x = d1, y = mean)) + geom_point() + geom_smooth() + xlim(1,10) 
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

For political opinions it seems that the relationship is less linear so we will work on the categories variable. Indeed, there seems to be a clear group (1to4) which is the same used in the variable with 3 categories (left, center and right so we will use the 3-categories variable.


MARRIAGE

x_df <- data_raw |> filter(qc19<3) |> summarise(mean = mean(qc19), .by = d7r1)
ggplot(x_df, aes(x = d7r1, y = mean)) + geom_point() + xlim(1,5) 
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

Of all the possible combinations this one seems the best to highlight differences in our target variable


NATIONALITY

There is no easy way to create an immigrant dummy by checking whether someone does not have the nationality of the country in which he is being interviewed for the survey. So we are just going to use q1_29 as a dummy for whether someone owns a non-EU nationality


POLITICAL INTEREST

Using only polintr as a summary of the whole d71 question about how strong is your interest in politics in various domains


EXPERIENCED DISCRIMINATION

For the questions about whether you have experienced discrimination qc2_ we are keeping only qc2_15 which is a binary on whether you have experienced any discrimination or not. We do not include the other categories that allowed us to understand if you had experienced discrimination on the basis of a particular motive. There are data quality issues with these variables and we can infer that if a person is a part of a specific minority (info from sd2) and is being discriminated, it is due to that minority status (at least in most cases).


PERCIEVED DISCRIMINATION in country

This is a question (qc1) about how widespread you perceived discrimination is in your country. It is a rather peculiar question because it asks individuals for perception at the country level. Due to the nature of mixed modelling, we have prioritised using the EU survey to create individual level discrimination measures. The country level measures are captured in that country level data.

For the same reason I am also discarding qc4 that asks whether in your country you feel like candidates with certain characteristics would be at a disadvantage in the recruitment process. It still basically asks about perceived discrimination at the country level.

And the same reasoning also applies to qc7 which asks if efforts towards reducing discrimination are effective in your country.


HOW DISCRIMINATORY ARE YOU SCORE

We use qc12 and qc13 to build a score from 1 to 10 of how discriminatory are you against certain minorities. We build the score for each minority and then we can decide later if some are irrelevant.

Note for some categories this score is a bit stupid (ex. voting for whether you would feel uncomfortable if your child was in a love relationship with a old person would give you a high discrimination score against old people, this is an example of some of the categories for which it is worth excluding the index)

We aggregate these into scores for discrimination (e.g.religious/ethnic score etc.). When required, we can further aggregate to obtain just one score measuring individual discrimination.

qc6 also asks about discriminatory behaviors, but is about elected public official, it does not ask about a situation that impacts you personally. It also uses different categories for minorities therefore I prefer to use qc12 and qc13 rather than qc6.


SUPPORTIVE OF LGBTQ RIGHTS INDEX

Using qc15 we can again build an index for how supportive of lgbtq rights a person is.

We again choose to take the mean across the different answers but we could also use median/mode if we think is better

We could also use qc18 to try to capture anti lgbtq sentiment, but I preferred qc15 as it seemed more straightforward (I therefore deleted qc18, but we can put it back if useful).


MY VOICE COUNTS

We have used this as a proxy for social alienation. We hypothesised that people who feel alienated from society and politics might be less likely to support lgbtq rights so we took the mean of the two subquestions in d72, which ask whether you felt like your voice mattered.


OTHERS

qc11 (willingness to provide information) is a strange question feels useless to me, I removed it.

I also removed qc9 as I don’t really know what to do with it and qc17.

Here we select our initial variables of interest from the above discussion.

data <- data_raw |> 
  select(serialid, # unique identifier
         isocntry, # 2 digit country code
         d11, # age variables
         q1_29, # nationality of interviewee. options given: EU28+Other+DK, using only other = outside of EU
         d70, #life satisfaction
         polintr, # political interest index (summarizes d71 questions)
         starts_with("sd1"), # friends that are minority groups
         starts_with("sd2"), # are you part of a minority
         sd3, # religion
         qc2_15, # experienced discrimination yourself
         qc3, # where discrimination took place
         starts_with("qc5"), # actions against discrimination
         qc10, # how would you report discrimination
         starts_with("qc12"), # feelings about colleagues being minority
         starts_with("qc13"), # feelings about kid being in a relationship with minority
         starts_with("qc15"), # opinions about lgbtqi 
         qc19, # target variable transgender
         qc20, # non-binary genders in documents
         d1r1, # political ideology
         d7r1, # marital status
         d10, # gender binary
         d8, # eduaction 
         d15a_r2, # current occupation (discarded previous occupation d15b)
         d25, # rural vs urban
         d43t, # phones availiabilty
         d60, # financial stress (paying bills)
         netuse, # internet index
         d63, # social class
         starts_with("d72"), # my voice counts
  )

paradata <- data_raw |> 
  select(serialid, # to match it with the other data
         p2, p3, p3r, p4, p5) # paradata)

Removing some of the disaggregrated questions

sd2_7 to sd2_10, these are possible answer deemed irrelevant to the question about yourself being part of the following minorities: sd2_7 other minorities (I don’t know what other relevant minorities could be there), sd2_8 not part of minorities (can be deducted from the rest), sd2_9 refusal to respond, sd2_10 don’t know answers, sd2t summary binary variable for being part of any minority (we are going to keep the more specific one)

I also remove qc12_nr as I prefer to work on the full variable instead of the recoded version with less categories. Same for qc13 and qc18

data <- data |> 
  select(-c(sd2_7, sd2_8, sd2_9, sd2_10, sd2t)) |> 
  select(-(starts_with("qc12") & ends_with("r"))) |> 
  select(-(starts_with("qc13") & ends_with("r"))) |> 
  select(-(starts_with("qc18") & ends_with("r")))

Check overall data quality

explore_tbl(data)

plot_intro(data)

All columns are numeric columns, the only one which is not is isocntry:

data |> 
  select(where(~ !is.numeric(.)))
## # A tibble: 27,438 × 1
##    isocntry
##    <chr>   
##  1 BE      
##  2 BE      
##  3 BE      
##  4 BE      
##  5 BE      
##  6 BE      
##  7 BE      
##  8 BE      
##  9 BE      
## 10 BE      
## # ℹ 27,428 more rows

Most columns do not have explicit NAs:

plot_missing(data)

Extract data labels

Exploiting the fact that the .dta files has attributes (labels) for all its columns.

If we search for attr(colname, "label") you get back the original long name of the variable

If we instead search for attr(colname, "labels") you get back all the possible encoding levels of the variable (ex.1,2,3,4 etc.) and by doing names(attr(colname, "labels")) you get back the actual meaning of those numbers (ex. 1 = “Yes”, 2 = “No”, etc.)

Will now look into the labels for the different levels that our factor variables can take:

attr(data$d70, "labels")
##       Very satisfied     Fairly satisfied   Not very satisfied 
##                    1                    2                    3 
## Not at all satisfied                   DK 
##                    4                    5
names(attr(data$d70, "labels"))
## [1] "Very satisfied"       "Fairly satisfied"     "Not very satisfied"  
## [4] "Not at all satisfied" "DK"
tibble(name_labels = names(attr(data$d70, "labels")),
       labels = attr(data$d70, "labels"))
## # A tibble: 5 × 2
##   name_labels          labels
##   <chr>                 <dbl>
## 1 Very satisfied            1
## 2 Fairly satisfied          2
## 3 Not very satisfied        3
## 4 Not at all satisfied      4
## 5 DK                        5
# Create a list of tibbles containing the labels and their associated name for each variable
list_label_tibbles <- 
  #Applies a function across all columns of a df and returns results as a list
  lapply(names(data), function(col_name) {
    labels <- attr(data[[col_name]], "labels")  # Extract labels
    name_labels <- names(labels)  # Extract label names
    # Create tibble with the extracted data only if labels exist
    if (!is.null(labels)) {
      tibble(name_labels = name_labels, labels = labels)} 
    else {NULL}  # Returns a NULL element for columns without labels
  })

# Giving to each element of the list as name the name of the variable
list_label_tibbles <- setNames(list_label_tibbles, names(data))

# For example
list_label_tibbles$d70
## # A tibble: 5 × 2
##   name_labels          labels
##   <chr>                 <dbl>
## 1 Very satisfied            1
## 2 Fairly satisfied          2
## 3 Not very satisfied        3
## 4 Not at all satisfied      4
## 5 DK                        5

Right column is what appears in our data (as a number). Left column is the label that we must assign to that number when we factorize

Cleaning

Initial cleaning and pre-processing

Recoding together Germany East and West because we are running analysis at the country level and they are coded separately.

unique(data$isocntry)
##  [1] "BE"   "DK"   "GR"   "ES"   "FI"   "FR"   "IE"   "IT"   "LU"   "NL"  
## [11] "AT"   "PT"   "SE"   "DE-W" "DE-E" "GB"   "BG"   "CY"   "CZ"   "EE"  
## [21] "HU"   "LV"   "LT"   "MT"   "PL"   "RO"   "SK"   "SI"   "HR"
data <- data |> 
  mutate(isocntry = case_when(
    isocntry %in% c("DE-W", "DE-E") ~ "DE",
    TRUE ~ isocntry))
# We might want to join the full name of the countries using the codelist df


Recode all factor NA’s

Separating variables for which we can use the attribute labels to factorize them and the variables for which this strategy cannot be used

data <- data |> 
  rename(friends_trans = sd1_7)

non_factor_variables <- c("serialid", "tnscntry", "isocntry", "d11", "q1_29", 
                          names(data)[startsWith(names(data), "sd1_")],
                          names(data)[startsWith(names(data), "sd2_")],
                          names(data)[startsWith(names(data), "qc5_")],
                          names(data)[startsWith(names(data), "qc12_")],
                          names(data)[startsWith(names(data), "qc13_")],
                          names(data)[startsWith(names(data), "qc15_")],
                          names(data)[startsWith(names(data), "d72_")],
                          "d8", "opls")
factor_variables <- setdiff(names(data), non_factor_variables)

Correctly encoding the factor variables that do not need any further cleaning

# Converting them to factors and assign them their labels automatically
data <- data |> 
  mutate(across(all_of(factor_variables), labelled::to_factor))

# Turning DK into NAs for all the factor variables
data <- data |> 
  mutate(across(all_of(factor_variables), ~ fct_na_level_to_value(., extra_levels = "DK")))

# Converting to numeric the variables that should be numeric
data <- data |> 
  mutate(age = as.numeric(d11),
         years_edu = as.numeric(d8)) |> 
  # Recoding correctly d8 education variable according to unique(data_raw$d11))
  mutate(years_edu = case_when(
    years_edu %in% c(0, 99) ~ NA, # Refusal and DK as NAs
    years_edu == 97 ~ 0, # No full time education = 0
    years_edu == 98 ~ age, # still studying = age
    TRUE ~ years_edu)) |> 
  select(-c(d11, d8))

# Converting q1_29 to factor without assigning labels (1 if non-Eu nationality, 0 if EU nationality)
# Same for sd2_
data <- data |> 
  mutate(nonEU_national = as.factor(q1_29),
         across(starts_with("sd2_"), ~ as.factor(.x))) |> 
  select(-q1_29)


Create new individual level measures

Dealing with the other variables on which we do some further pre-processing (feature engineering).

Creating a variable that counts the number of different minority groups a person has acquaintances with:

data <- data |>  
  mutate(across(starts_with("sd1"), ~ if_else(.x == 1, 1, 0))) |> 
  mutate(n_friends_minorities = sd1_1+sd1_2+sd1_3+sd1_4+sd1_5+sd1_6+sd1_8) |> 
  relocate(n_friends_minorities, .before="sd1_1") |> 
  select(-starts_with("sd1"))

Creating a variable that counts the number of actions against discrimination that you have taken in the last year:

data <- data |>  
  mutate(across(starts_with("qc5"), ~ if_else(.x == 1, 1, 0))) |> 
  mutate(n_actions_against_discri = qc5_1+qc5_2+qc5_3+qc5_4) |> 
  relocate(n_actions_against_discri, .after="qc3") |> 
  select(-starts_with("qc5"))

Building a discriminatory score:

data <- data |> 
  # Coding as NAs "it depends" and "DK"
  mutate(across(starts_with("qc12"), ~ if_else(.x >= 12, NA, .x))) |> 
  mutate(across(starts_with("qc13"), ~ if_else(.x >= 12, NA, .x))) |> 
  # Coding as 5 responses = indifferent
  mutate(across(starts_with("qc12"), ~ if_else(.x == 11, 5, .x))) |> 
  mutate(across(starts_with("qc13"), ~ if_else(.x == 11, 5, .x))) |> 
  # Modifying such that higher is more discriminatory
  mutate(roma_discri = 11 - rowMeans(cbind(qc12_1, qc13_1), na.rm = TRUE),
         black_discri = 11 - rowMeans(cbind(qc12_2, qc13_2), na.rm = TRUE),
         asian_discri = 11 - rowMeans(cbind(qc12_3, qc13_3), na.rm = TRUE),
         white_discri = 11 - rowMeans(cbind(qc12_4, qc13_4), na.rm = TRUE),
         jewish_discri = 11 - rowMeans(cbind(qc12_5, qc13_5), na.rm = TRUE),
         muslim_discri = 11 - rowMeans(cbind(qc12_6, qc13_6), na.rm = TRUE),
         buddihst_discri = 11 - rowMeans(cbind(qc12_7, qc13_7), na.rm = TRUE),
         christian_discri = 11 - rowMeans(cbind(qc12_8, qc13_8), na.rm = TRUE),
         atheist_discri = 11 - rowMeans(cbind(qc12_9, qc13_9), na.rm = TRUE),
         lgb_discri = 11 - rowMeans(cbind(qc12_10, qc13_10), na.rm = TRUE),
         trans_discri = 11 - rowMeans(cbind(qc12_11, qc13_11), na.rm = TRUE),
         intersex_discri = 11 - rowMeans(cbind(qc12_12, qc13_12), na.rm = TRUE),
         disability_discri = 11 - rowMeans(cbind(qc12_13, qc13_13), na.rm = TRUE),
         young_discri = 11 - rowMeans(cbind(qc12_14, qc13_14), na.rm = TRUE),
         old_discri = 11 - rowMeans(cbind(qc12_15, qc13_15), na.rm = TRUE)) |> 
  select(-starts_with("qc12")) |> 
  select(-starts_with("qc13"))

We delete the discrimination index against young and old people. This is because one of the question is: how comfortable you would feel if one of your children was in a love relationship with a person from group x. We don’t believe these are relevant to our target or discrimination score.

data <- data |> 
  select(-c("old_discri", "young_discri"))

Building a score of how supportive or anti lbtq rights you are:

data <- data |> 
  mutate(across(starts_with("qc15"), ~ if_else(.x == 5, NA, .x))) |>
  mutate(antilgbtq_rights = round(rowMeans(cbind(qc15_1, qc15_2, qc15_3), na.rm = TRUE), 2)) |> 
  select(-starts_with("qc15"))
# Scale of 1 to 4, 1 = supportive, 4 = homophobic

Aggregating together questions on whether you think your voice counts:

data <- data |> 
  mutate(across(starts_with("d72"), ~ if_else(.x > 4, NA, .x))) |>
  mutate(social_alienation = rowMeans(cbind(d72_1, d72_2), na.rm = TRUE)) |> 
  select(-starts_with("d72"))
# The higher the more people think their voice does not matter

This should be our final selection of variables.

We still need to rename them appropriately and check that all the NAs are correctly encoded by looking at the summary.

Further cleaning

Rename columns appropriately

data <- data |> 
  rename(
    country = isocntry,
    life_sat = d70,
    ethnic_minority = sd2_1,
    skincolor_minority = sd2_2,
    religious_minority = sd2_3,
    roma_minority = sd2_4,
    sexual_minority = sd2_5,
    disability_minority = sd2_6,
    religion = sd3,
    disc = qc2_15,
    disc_where = qc3,
    disc_contact = qc10,
    trans_docs = qc19,
    gender_docs = qc20,
    left_right = d1r1,
    marital_status = d7r1,
    gender = d10,
    occupation = d15a_r2,
    community = d25,
    phone_access = d43t,
    bill_issues = d60,
    internet_use = netuse,
    social_class = d63
  ) 

Now, we will transform the variable disc, a variable storing whether you have been subject to discrimination, which was originally coded in a negative way (1 = “Not mentioned”, 2 = “No, you haven’t been discriminated against”), into a positive dummy variable to make its interpretation more straightforward. The variable is 1 if a person has been subject to discrimination:

data <- data |> 
  mutate(suffered_discr = as.factor(ifelse(disc == "Not mentioned", 1, 0))) |> 
  select(-disc) |> 
  relocate(suffered_discr, .before = disc_where)

Move the variables around to have a ordered dataframe, and assign labels to help interpretation:

# Relocating to have a ordered df
data <- data |> 
  relocate(c("age", "gender", "years_edu","community", "marital_status", "occupation", "social_class", "religion", "nonEU_national", "phone_access", "bill_issues", "internet_use"), .after = country) |> 
  relocate(c("left_right", "social_alienation"), .after = polintr) |> 
  relocate(c("friends_trans", "n_friends_minorities", "n_actions_against_discri"), .after = gender_docs)

# Assigning labels to columns which have a difficult meaning
attr(data$gender, "label") <- NULL
attr(data$nonEU_national, "label") <- "OWNS A NON-EU NATIONALITY"
attr(data$social_alienation, "label") <- "HIGHER -> THINK THEIR VOICE DOESN'T  MATTER"
attr(data$ethnic_minority, "label") <- "ARE YOU PART OF X MINORITY" 
attr(data$skincolor_minority, "label") <- "ARE YOU PART OF X MINORITY" 
attr(data$religious_minority, "label") <- "ARE YOU PART OF X MINORITY" 
attr(data$roma_minority, "label") <- "ARE YOU PART OF X MINORITY" 
attr(data$sexual_minority, "label") <- "ARE YOU PART OF X MINORITY" 
attr(data$disability_minority, "label") <- "ARE YOU PART OF X MINORITY" 
attr(data$disability_minority, "label") <- "ARE YOU PART OF X MINORITY" 
attr(data$disability_minority, "label") <- "ARE YOU PART OF X MINORITY"
attr(data$suffered_discr, "label") <- "HAVE YOU BEEN SUBJECT TO DISCRIMINATION"
attr(data$n_friends_minorities, "label") <- "YOU KNOW PEOPLE FROM # NUMBER OF DIFFERENT MINORITES"
attr(data$n_actions_against_discri, "label") <- "YOU HAVE DONE # NUMBER OF DIFFERENT ACTIONS TO FIGHT DISCRIMINATIONS"
attr(data$roma_discri, "label") <- "HOW DISCRIMINATORY ARE YOU AGAINST X"
attr(data$black_discri, "label") <- "HOW DISCRIMINATORY ARE YOU AGAINST X"
attr(data$asian_discri, "label") <- "HOW DISCRIMINATORY ARE YOU AGAINST X"
attr(data$white_discri, "label") <- "HOW DISCRIMINATORY ARE YOU AGAINST X"
attr(data$jewish_discri, "label") <- "HOW DISCRIMINATORY ARE YOU AGAINST X"
attr(data$muslim_discri, "label") <- "HOW DISCRIMINATORY ARE YOU AGAINST X"
attr(data$buddihst_discri, "label") <- "HOW DISCRIMINATORY ARE YOU AGAINST X"
attr(data$christian_discri, "label") <- "HOW DISCRIMINATORY ARE YOU AGAINST X"
attr(data$atheist_discri, "label") <- "HOW DISCRIMINATORY ARE YOU AGAINST X"
attr(data$atheist_discri, "label") <- "HOW DISCRIMINATORY ARE YOU AGAINST X"
attr(data$lgb_discri, "label") <- "HOW DISCRIMINATORY ARE YOU AGAINST X"
attr(data$trans_discri, "label") <- "HOW DISCRIMINATORY ARE YOU AGAINST X"
attr(data$intersex_discri, "label") <- "HOW DISCRIMINATORY ARE YOU AGAINST X"
attr(data$disability_discri, "label") <- "HOW DISCRIMINATORY ARE YOU AGAINST X"
attr(data$antilgbtq_rights, "label") <- "HIGHER -> THEY OPPOSE MORE RIGHTS TO LGBTQ"

This is our dataset so far:

summary(data)
##     serialid       country               age          gender     
##  Min.   :    1   Length:27438       Min.   :15.00   Man  :12492  
##  1st Qu.: 6860   Class :character   1st Qu.:37.00   Woman:14946  
##  Median :13720   Mode  :character   Median :53.00                
##  Mean   :13720                      Mean   :51.56                
##  3rd Qu.:20579                      3rd Qu.:66.00                
##  Max.   :27438                      Max.   :98.00                
##                                                                  
##    years_edu                          community    
##  Min.   : 0.00   Rural area or village     : 8776  
##  1st Qu.:17.00   Small or middle sized town:10767  
##  Median :18.00   Large town                : 7881  
##  Mean   :19.66   NA's                      :   14  
##  3rd Qu.:22.00                                     
##  Max.   :90.00                                     
##  NA's   :387                                       
##                                 marital_status 
##  (Re-)Married (1-4 in d7)              :14673  
##  Single living with partner (5-8 in d7): 3321  
##  Single (9-10 in d7)                   : 4314  
##  Divorced or separated (11-12 in d7)   : 2243  
##  Widow (13-14 in d7)                   : 2712  
##  Other (SPONT.)                        :  107  
##  Refusal (SPONT.)                      :   68  
##                                   occupation  
##  Retired (4 in d15a)                   :8791  
##  Manual workers (15 to 18 in d15a)     :5883  
##  Other white collars (13 or 14 in d15a):3536  
##  Managers (10 to 12 in d15a)           :2911  
##  Self-employed (5 to 9 in d15a)        :1979  
##  Students (2 in d15a)                  :1676  
##  (Other)                               :2662  
##                             social_class                       religion    
##  The middle class of society      :13068   Catholic                :11198  
##  The working class of society     : 7233   Orthodox Christian      : 4016  
##  The lower middle class of society: 4070   Non believer or agnostic: 3694  
##  The upper middle class of society: 1889   Protestant              : 3031  
##  None (SPONTANEOUS)               :  229   Atheist                 : 2109  
##  (Other)                          :  389   (Other)                 : 3220  
##  NA's                             :  560   NA's                    :  170  
##  nonEU_national            phone_access               bill_issues   
##  0:27316        Mobile only      :14555   Most of the time  : 2054  
##  1:  122        Landline only    :  802   From time to time : 6538  
##                 Landline & mobile:11576   Almost never/never:18467  
##                 No telephone     :  505   Refusal (SPONT.)  :  379  
##                                                                     
##                                                                     
##                                                                     
##                      internet_use                   life_sat    
##  Everyday/Almost everyday  :19900   Very satisfied      : 7242  
##  Two or three times a week : 1701   Fairly satisfied    :15356  
##  About once a week         :  434   Not very satisfied  : 3736  
##  Two or three times a month:  164   Not at all satisfied: 1002  
##  Less often                :  350   NA's                :  102  
##  Never/No access           : 4311                               
##  No Internet access at all :  578                               
##        polintr               left_right   social_alienation ethnic_minority
##  Strong    : 4644   (1 - 4) Left  :7082   Min.   :1.000     0:26603        
##  Medium    :13701   (5 - 6) Centre:9313   1st Qu.:1.500     1:  835        
##  Low       : 4387   (7 -10) Right :6354   Median :2.000                    
##  Not at all: 4706   DK/Refusal    :4689   Mean   :2.329                    
##                                           3rd Qu.:3.000                    
##                                           Max.   :4.000                    
##                                           NA's   :1000                     
##  skincolor_minority religious_minority roma_minority sexual_minority
##  0:26923            0:26416            0:27001       0:27006        
##  1:  515            1: 1022            1:  437       1:  432        
##                                                                     
##                                                                     
##                                                                     
##                                                                     
##                                                                     
##  disability_minority suffered_discr
##  0:26756             0:23150       
##  1:  682             1: 4288       
##                                    
##                                    
##                                    
##                                    
##                                    
##                                                           disc_where   
##  In a public space                                             :  893  
##  At work                                                       :  788  
##  When looking for a job                                        :  642  
##  At a café, restaurant, bar or nightclub                       :  322  
##  By healthcare personnel (e.g. a receptionist, nurse or doctor):  308  
##  (Other)                                                       : 1054  
##  NA's                                                          :23431  
##                                                                       disc_contact 
##  The police                                                                 :8769  
##  A friend or family member                                                  :5386  
##  An equalities body or ombudsman (SPECIFY THE NAME ACCORDING TO THE COUNTRY):3795  
##  A lawyer                                                                   :2329  
##  Courts                                                                     :1225  
##  (Other)                                                                    :3601  
##  NA's                                                                       :2333  
##  trans_docs   gender_docs                friends_trans   n_friends_minorities
##  Yes :14463   Yes :10856   Yes                  : 2644   Min.   :0.000       
##  No  : 9695   No  :13395   No                   :23520   1st Qu.:1.000       
##  NA's: 3280   NA's: 3187   Refusal (SPONTANEOUS):  306   Median :3.000       
##                            NA's                 :  968   Mean   :3.029       
##                                                          3rd Qu.:5.000       
##                                                          Max.   :7.000       
##                                                                              
##  n_actions_against_discri  roma_discri      black_discri     asian_discri   
##  Min.   :0.0000           Min.   : 1.000   Min.   : 1.000   Min.   : 1.000  
##  1st Qu.:0.0000           1st Qu.: 1.500   1st Qu.: 1.000   1st Qu.: 1.000  
##  Median :0.0000           Median : 4.500   Median : 3.000   Median : 2.500  
##  Mean   :0.3855           Mean   : 4.631   Mean   : 3.579   Mean   : 3.387  
##  3rd Qu.:0.0000           3rd Qu.: 7.000   3rd Qu.: 5.500   3rd Qu.: 5.500  
##  Max.   :4.0000           Max.   :10.000   Max.   :10.000   Max.   :10.000  
##                           NA's   :757      NA's   :542      NA's   :592     
##   white_discri    jewish_discri   muslim_discri    buddihst_discri 
##  Min.   : 1.000   Min.   : 1.00   Min.   : 1.000   Min.   : 1.000  
##  1st Qu.: 1.000   1st Qu.: 1.00   1st Qu.: 1.000   1st Qu.: 1.000  
##  Median : 1.000   Median : 2.00   Median : 4.000   Median : 3.000  
##  Mean   : 1.785   Mean   : 3.18   Mean   : 4.363   Mean   : 3.576  
##  3rd Qu.: 2.000   3rd Qu.: 5.00   3rd Qu.: 6.500   3rd Qu.: 5.500  
##  Max.   :10.000   Max.   :10.00   Max.   :10.000   Max.   :10.000  
##  NA's   :403      NA's   :603     NA's   :662      NA's   :761     
##  christian_discri atheist_discri     lgb_discri      trans_discri   
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.000   Min.   : 1.000  
##  1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 2.000  
##  Median : 1.000   Median : 1.500   Median : 4.000   Median : 5.000  
##  Mean   : 1.904   Mean   : 2.802   Mean   : 4.399   Mean   : 4.934  
##  3rd Qu.: 2.000   3rd Qu.: 4.000   3rd Qu.: 6.500   3rd Qu.: 7.500  
##  Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.000  
##  NA's   :414      NA's   :517      NA's   :536      NA's   :1081    
##  intersex_discri  disability_discri antilgbtq_rights
##  Min.   : 1.000   Min.   : 1.000    Min.   :1.000   
##  1st Qu.: 1.500   1st Qu.: 1.000    1st Qu.:1.000   
##  Median : 5.000   Median : 2.500    Median :2.000   
##  Mean   : 4.825   Mean   : 2.977    Mean   :2.147   
##  3rd Qu.: 7.500   3rd Qu.: 4.500    3rd Qu.:3.000   
##  Max.   :10.000   Max.   :10.000    Max.   :4.000   
##  NA's   :1331     NA's   :585       NA's   :791

Our dataset seems quite balanced across the different countries

data |> count(country)
## # A tibble: 28 × 2
##    country     n
##    <chr>   <int>
##  1 AT       1027
##  2 BE       1028
##  3 BG       1032
##  4 CY        503
##  5 CZ       1008
##  6 DE       1537
##  7 DK       1004
##  8 EE       1003
##  9 ES       1005
## 10 FI       1003
## # ℹ 18 more rows

Although some levels of occupation have more observations than others, we have enough observations for each level so we do not need to aggregate over different levels for the variable occupation

data |> count(occupation)
## # A tibble: 8 × 2
##   occupation                                 n
##   <fct>                                  <int>
## 1 Self-employed (5 to 9 in d15a)          1979
## 2 Managers (10 to 12 in d15a)             2911
## 3 Other white collars (13 or 14 in d15a)  3536
## 4 Manual workers (15 to 18 in d15a)       5883
## 5 House persons (1 in d15a)               1358
## 6 Unemployed (3 in d15a)                  1304
## 7 Retired (4 in d15a)                     8791
## 8 Students (2 in d15a)                    1676

Very few people declare themselves to be the higher class of society, nonetheless we keep this level because it makes sense.

data |> count(social_class)
## # A tibble: 9 × 2
##   social_class                          n
##   <fct>                             <int>
## 1 The working class of society       7233
## 2 The lower middle class of society  4070
## 3 The middle class of society       13068
## 4 The upper middle class of society  1889
## 5 The higher class of society         157
## 6 Other (SPONTANEOUS)                  59
## 7 None (SPONTANEOUS)                  229
## 8 Refusal (SPONTANEOUS)               173
## 9 <NA>                                560

Given the low number of observations in some categories of the variable religion and the high number of categories we opt to aggregate some of them, otherwise we risk too much noise in our models.

data |> count(religion)
## # A tibble: 16 × 2
##    religion                     n
##    <fct>                    <int>
##  1 Catholic                 11198
##  2 Orthodox Christian        4016
##  3 Protestant                3031
##  4 Other Christian           1183
##  5 Jewish                      58
##  6 Muslim - Shia               77
##  7 Muslim - Sunni             178
##  8 Other Muslim               137
##  9 Sikh                        13
## 10 Buddhist                    58
## 11 Hindu                       32
## 12 Atheist                   2109
## 13 Non believer or agnostic  3694
## 14 Other                     1171
## 15 Refusal (SPONTANEOUS)      313
## 16 <NA>                       170
# Checking if the means of groups we are going to aggregate are similar
data |> 
  mutate(trans_docs=as.numeric(trans_docs)) |> 
  summarise(mean = mean(trans_docs, na.rm=TRUE), .by = religion) |>
  arrange(mean)
## # A tibble: 16 × 2
##    religion                  mean
##    <fct>                    <dbl>
##  1 Sikh                      1.25
##  2 Atheist                   1.27
##  3 Non believer or agnostic  1.28
##  4 Protestant                1.30
##  5 Buddhist                  1.33
##  6 Other                     1.37
##  7 Jewish                    1.38
##  8 Other Christian           1.43
##  9 Catholic                  1.43
## 10 Hindu                     1.44
## 11 Muslim - Shia             1.45
## 12 <NA>                      1.45
## 13 Other Muslim              1.49
## 14 Refusal (SPONTANEOUS)     1.49
## 15 Muslim - Sunni            1.54
## 16 Orthodox Christian        1.58

We are going to group together atheist with agnostic. We are also going to put sikh, buddhists, jewish and hindu into the other category because there are only very few observations. Finally we are going to group together all muslims.

data <- data |> 
  mutate(religion = fct_collapse(religion,
                                "Non-believers" = c("Atheist", "Non believer or agnostic"),
                                "Other" = c("Sikh", "Buddhist", "Jewish", "Hindu", "Other"),
                                "Muslim" = c("Muslim - Shia", "Muslim - Sunni", "Other Muslim")))

#This is what we end up with
data |> count(religion)
## # A tibble: 9 × 2
##   religion                  n
##   <fct>                 <int>
## 1 Catholic              11198
## 2 Orthodox Christian     4016
## 3 Protestant             3031
## 4 Other Christian        1183
## 5 Other                  1332
## 6 Muslim                  392
## 7 Non-believers          5803
## 8 Refusal (SPONTANEOUS)   313
## 9 <NA>                    170

Recoding missing values

In this section we will convert all levels of factors that are redundant to NAs.

First we will substitute de “Refusal” answers in the remaining variables as NAs. If the respondent refuses to answer, we treat is as equivalent to not knowing his or her answer. We exclude friends_trans because for that category it could be worth analyzing refusals.

factor_variables <- names(data)[sapply(data, is.factor)]

data <- data %>%
  mutate(across(
    all_of(setdiff(factor_variables, "friends_trans")),  # Exclude "friends_trans"
    ~ {
      # Look for levels that contain "REFUSAL"
      refusal_levels <- grep("REFUSAL", levels(.), value = TRUE, ignore.case = TRUE)
      # If there are levels containing "REFUSAL"
      if (length(refusal_levels) > 0) { 
        # Convert in NA
        fct_recode(., NULL = refusal_levels)  
      } 
      # if not remain without changes
      else {
        .  
      }
    }
  ))

Mutate NONE level of social_class to NA.

We also turn marital_status’s level OTHER into NAs, as it is difficult to give it any other meaning. Same for social class

data <- data %>%
  mutate(social_class = fct_recode(social_class, NULL = "None (SPONTANEOUS)"))

data <- data %>%
  mutate(marital_status = fct_recode(marital_status, NULL = "Other (SPONT.)"),
         social_class = fct_recode(social_class, NULL = "Other (SPONTANEOUS)"))
plot_intro(data)

plot_missing(data)

The total number of missing observations is quite low (4%) so missing data should not be a huge problem.

Given the high percentage of missing values we delete disc_where. This variable was expected to have a high percentage of missingness as it is a question that applies only to people that have been subject to discrimination. So it is not a variable of huge importance.

The variable with the second highest number of missing data is left_right, given the importance of this variable it is probably worth to impute those values.

Next there are the variables trans_docs and gender_docs which are respectively our target variable and a closely related variable

Then there is disc_contact which will be deleted as well, given that it does not add meaningful information to our analysis.

data <- data |> 
  select(-c("disc_where", "disc_contact"))

Exploratory Data Analysis

Plotting the distribution of the numeric variables

# Identify numeric variables
numeric_vars <- names(data)[sapply(data, is.numeric)]
numeric_vars <- numeric_vars[numeric_vars != "serialid"]

# Creating histogramas y calculating skweness
for (var in numeric_vars) {
  p <- ggplot(data, aes(x = .data[[var]])) +
    geom_histogram(binwidth = 1, fill = "blue", color = "black") +
    labs(title = paste("Histogram of", var), x = var, y = "Count") +
    theme_minimal()
  print(p)}

Analysis of individual-level variables

We must keep in mind our target variable is qc19 “Do you think that transgender persons should be able to change their civil documents to match their inner gender identity?”

data_percent <- data |>
  mutate(trans_docs = fct_na_value_to_level(trans_docs, "Don't Know")) |> 
  group_by(trans_docs) |> 
  summarise(count = n()) |> 
  mutate(percentage = count / sum(count) * 100)

ggplot(data_percent, aes(x = trans_docs, y = percentage, fill = trans_docs)) +
  geom_bar(stat = "identity") +  
  geom_text(aes(label = sprintf("%.1f%%", percentage)),  
            vjust = -0.5, size = 4, color = "black") +  
  scale_y_continuous(labels = scales::percent_format(scale = 1)) + 
  labs(
    title = "Support for the right of trans people to change their gender in civil documents",
    x = "Support for Transgender Rights",
    y = "Percentage"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Taking a look to our variable of interest alone, we see how the 52.7% of our sample are in favor of trans people to change their gender in their civil documents. However, there is also a significant opposition (35,3%), and a 12% who answered “Don’t know”. We will try to explore this distribution along the variables we consider to be most important for our analysis.

Sociodemographic variables

Gender

data_summary <- data |> 
  mutate(trans_docs = fct_na_value_to_level(trans_docs, "Don't Know")) |> 
  count(trans_docs, gender, name = "n") |> 
  group_by(gender) |>  
  mutate(percentage = n / sum(n))

ggplot(data_summary, aes(x = gender, y = percentage, fill = trans_docs)) +
  geom_bar(stat = "identity", position = "fill") +
  geom_text(aes(label = scales::percent(percentage, accuracy = 1)),  
            position = position_stack(vjust = 0.5), size = 4) + 
  scale_y_continuous(labels = percent_format()) + 
  scale_fill_manual(
    values = c("Yes" = "#2ECC71", "No" = "#E74C3C", "Don't Know" = "#3498DB")
  ) +
  labs(
    title = "Support for the right of trans people to change their gender in civil documents",
    x = "Gender",
    y = "Proportion of Responses",
    fill = "Support the right"
  ) +
  coord_flip() +  
  theme_minimal()+
  theme(
    panel.grid = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
  )

# Create contingency table
contingency_table <- table(data$gender, data$trans_docs)

# Calculate chisquare test
chisq_test <- chisq.test(contingency_table)
print(chisq_test)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  contingency_table
## X-squared = 92.715, df = 1, p-value < 2.2e-16

Men exhibit a lower percentage of favorable or don’t know responses and a higher rate of rejection compared to women. There is a statistically significant association between gender and our target variable, being women more supportive

Age

data |> 
  mutate(trans_docs = fct_na_value_to_level(trans_docs, "Don't Know")) |>
  mutate(age_bin = cut(age, breaks = seq(min(age, na.rm = TRUE), max(age, na.rm = TRUE), by = 10), include.lowest = TRUE)) |> 
  # There are 4 observations above 95 y.o. that do not fall within any of the previously difined bins. Given their low number we drop them
  drop_na(age_bin) |> 
  count(age_bin, trans_docs) |> 
  group_by(age_bin) |> 
  mutate(percentage = n / sum(n) * 100) |>  
  ggplot(aes(x = age_bin, y = percentage, fill = trans_docs)) +
  geom_bar(stat = "identity", position = "stack") +
  coord_flip() +
  labs(x = "Age Bins", y = "Proportion of responses", fill = "Support the right",
       title = "Support for the right of trans people to change their gender in civil documents") +
  scale_fill_manual(
    values = c("Yes" = "#2ECC71", "No" = "#E74C3C", "Don't Know" = "#3498DB")
  ) +
  theme_minimal() +
   theme(
    panel.grid = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
  )

Younger people seem more likely to support the right for transgender people to have their gender changed on official documents.

We also notice a big increase of NAs for older people, meaning older people are less likely to respond to this question (because of unfamiliarity with the topic maybe).

So how would that distribution look if we get rid of NAs

data |> 
  drop_na(trans_docs) |> 
  mutate(age_bin = cut(age, breaks = seq(min(age, na.rm = TRUE), max(age, na.rm = TRUE), by = 10), include.lowest = TRUE)) |> 
  # There are 4 observations above 95 y.o. that do not fall within any of the previously defined bins. Given their low number we drop them
  drop_na(age_bin) |> 
  count(age_bin, trans_docs) |> 
  group_by(age_bin) |> 
  mutate(percentage = n / sum(n) * 100) |>  
  ggplot(aes(x = age_bin, y = percentage, fill = trans_docs)) +
  geom_bar(stat = "identity", position = "stack") +
  coord_flip() +
  labs(x = "Age Bins", y = "Proportion of responses", fill = "Support the right",
       title = "Support for the right of trans people to change their gender in civil documents") +
  scale_fill_manual(
    values = c("Yes" = "#2ECC71", "No" = "#E74C3C", "Don't Know" = "#3498DB")
  ) +
  theme_minimal() +
   theme(
    panel.grid = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
  )

We see that differences among age groups are not so evident anymore.

Religiosity

We have two variables related to religiosity: religion (the religious affiliation professed by the respondent) and religious_minority (whether or not the respondent belongs to a religious minority group).

data_summary <- data |> 
  mutate(trans_docs = fct_na_value_to_level(trans_docs, "Don't Know")) |>
  mutate(religious_minority = fct_recode(religious_minority, "Yes" = "1", "No" = "0")) |> 
  count(trans_docs, religion, name = "n") |> 
  group_by(religion) |>  
  mutate(percentage = n / sum(n))

ggplot(data_summary, aes(x = reorder(religion, ifelse(trans_docs == "Yes", -percentage, 0)), y = percentage, fill = trans_docs)) +
  geom_bar(stat = "identity", position = "fill") +
  geom_text(aes(label = scales::percent(percentage, accuracy = 1)),  
            position = position_stack(vjust = 0.5), size = 4) + 
  scale_y_continuous(labels = percent_format()) + 
  scale_fill_manual(
    values = c("Yes" = "#2ECC71", "No" = "#E74C3C", "Don't Know" = "#3498DB")
  ) +
  labs(
    title = "Support for the right of trans people to change their gender in civil documents",
    x = "Are you part of a religious minority?",
    y = "Proportion of Responses",
    fill = "Support the right"
  ) +
  coord_flip() +  
  theme_minimal()+
  theme(
    panel.grid = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
  )

# Create contingency table
contingency_table <- table(data$religion, data$trans_docs)

# Calculate chisquare test
chisq_test <- chisq.test(contingency_table)
print(chisq_test)
## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 973.3, df = 6, p-value < 2.2e-16

The small p-value confirms that the relationship between religion and trans_docs is statistically significant. We see that non-believers are most likely to support the rights, while people who are Orthodox Christian and Muslim are most likely to oppose the rights. They are also most likely to not provide a don’t know response.

We then run for the religious minority question (sd2_2). This provides us with less clear results.

data_summary <- data |> 
  mutate(trans_docs = fct_na_value_to_level(trans_docs, "Don't Know")) |>
  mutate(religious_minority = fct_recode(religious_minority, "Yes" = "1", "No" = "0")) |> 
  count(trans_docs, religious_minority, name = "n") |> 
  group_by(religious_minority) |>  
  mutate(percentage = n / sum(n))  

ggplot(data_summary, aes(x = religious_minority, y = percentage, fill = trans_docs)) +
  geom_bar(stat = "identity", position = "fill") +
  geom_text(aes(label = scales::percent(percentage, accuracy = 1)),  
            position = position_stack(vjust = 0.5), size = 4) + 
  scale_y_continuous(labels = percent_format()) + 
  scale_fill_manual(
    values = c("Yes" = "#2ECC71", "No" = "#E74C3C", "Don't Know" = "#3498DB")
  ) +
  labs(
    title = "Support for the right of trans people to change their gender in civil documents",
    x = "Are you part of a religious minority?",
    y = "Proportion of Responses",
    fill = "Support the right"
  ) +
  coord_flip() +  
  theme_minimal()+
  theme(
    panel.grid = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
  )

Ideology

data_summary <- data |> 
  mutate(trans_docs = fct_na_value_to_level(trans_docs, "Don't Know")) |>
  count(trans_docs, left_right, name = "n") |> 
  group_by(left_right) |>  
  mutate(percentage = n / sum(n))  

ggplot(data_summary, aes(x = left_right, y = percentage, fill = trans_docs)) +
  geom_bar(stat = "identity", position = "fill") +
  geom_text(aes(label = scales::percent(percentage, accuracy = 1)),  
            position = position_stack(vjust = 0.5), size = 4) + 
  scale_y_continuous(labels = percent_format()) + 
  scale_fill_manual(
    values = c("Yes" = "#2ECC71", "No" = "#E74C3C", "Don't Know" = "#3498DB")
  ) +
  labs(
    title = "Support for the right of trans people to change their gender in civil documents",
    x = "Political affiliation",
    y = "Proportion of Responses",
    fill = "Support the right"
  ) +
  coord_flip() +  
  theme_minimal()+
  theme(
    panel.grid = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
  )

Stronger left-wing support for trans rights. Right-wing respondents are the most divided, with high opposition levels. Non-responses are highest among those without ideological alignment and right leaning respondents.

Social class

data_summary <- data |> 
  mutate(trans_docs = fct_na_value_to_level(trans_docs, "Don't Know")) |>
  count(trans_docs, social_class, name = "n") |> 
  group_by(social_class) |>  
  mutate(percentage = n / sum(n))  

ggplot(data_summary, aes(x = social_class, y = percentage, fill = trans_docs)) +
  geom_bar(stat = "identity", position = "fill") +
  geom_text(aes(label = scales::percent(percentage, accuracy = 1)),  
            position = position_stack(vjust = 0.5), size = 4) + 
  scale_y_continuous(labels = percent_format()) + 
  scale_fill_manual(
    values = c("Yes" = "#2ECC71", "No" = "#E74C3C", "Don't Know" = "#3498DB")
  ) +
  labs(
    title = "Support for the right of trans people to change \ntheir gender in civil documents",
    x = "You identify as:",
    y = "Proportion of Responses",
    fill = "Support the right"
  ) +
  coord_flip() +  
  theme_minimal()+
  theme(
    panel.grid = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
  )

Although D63 of the questionnaire measures (oneself’s and one’s household’s) self-perceived social class, it is still a relevant and seemingly divisive variable. We observe a positive correlation, where support grows with increasing (perceived) social class, while explicit lack of support and no response are higher for lower-class individuals.

Education

# limit to main age ranges in age completed
data |> 
  mutate(main_educ = ifelse(between(years_edu, 15, 30), 1, 0)) |> 
  group_by(main_educ) |> 
  summarise(count = n(),
            prop = count/nrow(data)*100)
## # A tibble: 3 × 3
##   main_educ count  prop
##       <dbl> <int> <dbl>
## 1         0  3188 11.6 
## 2         1 23863 87.0 
## 3        NA   387  1.41
data |> 
  filter(between(years_edu, 15, 30)) |> 
  mutate(trans_docs = fct_na_value_to_level(trans_docs, "Don't Know")) |>
  count(trans_docs, years_edu, name = "n") |> 
  group_by(years_edu) |>  
  mutate(percentage = n / sum(n)) |> 
  ggplot(aes(x = years_edu, y = percentage, fill = trans_docs)) +
      geom_bar(stat = "identity", position = "fill") +
    geom_text(aes(label = scales::percent(percentage, accuracy = 1)),  
            position = position_stack(vjust = 0.5), size = 3) + 
  scale_y_continuous(labels = percent_format()) + 
  scale_fill_manual(
    values = c("Yes" = "#2ECC71", "No" = "#E74C3C", "Don't Know" = "#3498DB")
  ) +
  labs(
    title = "Distribution of support by education",
    x = "Age completed education, from 15 to 30 years old",
    y = "Proportion of Responses",
    fill = "Support the right"
  ) +
  theme_minimal()+
  theme(
    panel.grid = element_blank())

Limiting the data top only look at support around the main education. Which includes around 87% of the observations, we see a less clear relationship between age finishing education and support. However those who appear to finish education before age 20 tend to have higher opposition.

Area of residence

data_summary <- data |> 
  mutate(trans_docs = fct_na_value_to_level(trans_docs, "Don't Know")) |>
  count(trans_docs, community, name = "n") |> 
  group_by(community) |>  
  mutate(percentage = n / sum(n))  

ggplot(data_summary, aes(x = community, y = percentage, fill = trans_docs)) +
  geom_bar(stat = "identity", position = "fill") +
  geom_text(aes(label = scales::percent(percentage, accuracy = 1)),  
            position = position_stack(vjust = 0.5), size = 4) + 
  scale_y_continuous(labels = percent_format()) + 
  scale_fill_manual(
    values = c("Yes" = "#2ECC71", "No" = "#E74C3C", "Don't Know" = "#3498DB")
  ) +
  labs(
    title = "Support for the right of trans people to change their \ngender in civil documents",
    x = "Place of residence",
    y = "Proportion of Responses",
    fill = "Support the right"
  ) +
  coord_flip() +  
  theme_minimal()+
  theme(
    panel.grid = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
  )

Support does not vary much by area of residence.

Other individual-level variables

Having trans friends

data_summary <- data |> 
  mutate(trans_docs = fct_na_value_to_level(trans_docs, "Don't Know")) |>
  count(trans_docs, friends_trans, name = "n") |> 
  group_by(friends_trans) |>  
  mutate(percentage = n / sum(n))  

ggplot(data_summary, aes(x = friends_trans, y = percentage, fill = trans_docs)) +
  geom_bar(stat = "identity", position = "fill") +
  geom_text(aes(label = scales::percent(percentage, accuracy = 1)),  
            position = position_stack(vjust = 0.5), size = 4) + 
  scale_y_continuous(labels = percent_format()) + 
  scale_fill_manual(
    values = c("Yes" = "#2ECC71", "No" = "#E74C3C", "Don't Know" = "#3498DB")
  ) +
  labs(
    title = "Support for the right of trans people to change their \ngender in civil documents",
    x = "Do you know a transgender person?",
    y = "Proportion of Responses",
    fill = "Support the right"
  ) +
  coord_flip() +  
  theme_minimal()+
  theme(
    panel.grid = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
  )

Having trans friends is associated with a higher support, while those who refuse to disclose their connections to trans people tend to oppose more.

Country-level EDA

country_support <- data %>%
  mutate(
    trans_docs = case_when(
      is.na(trans_docs) ~ "DK",
      trans_docs == "Yes" ~ "1",
      trans_docs == "No" ~ "2",
      TRUE ~ as.character(trans_docs)
    )
  ) %>%
  group_by(country, trans_docs = trans_docs) %>%
  summarise(count = n(), .groups = "drop") %>%
  group_by(country) %>%
  mutate(proportion = (count / sum(count))*100) %>%
  ungroup()
trans_docs_prop <- country_support %>%
  filter(trans_docs == "1") %>%
  select(country, proportion) %>%
  rename(trans_support = proportion)

country_long <- country_level_data %>%
  inner_join(trans_docs_prop, by = c("iso2c" = "country")) |> 
  pivot_longer(cols = c(gdp_pc_ppp, gender_inequality_index, lgbt_policy_index, democracy_index),
    names_to = "variable", values_to = "value")

ggplot(country_long, aes(x = value, y = trans_support)) +
  geom_smooth(method = "lm", se = FALSE, color = "lightgray") + # we can remove this
  geom_point(size = 2, color = "blue", alpha = 0.7) +
  geom_text(aes(label = iso3c), vjust = -0.5, hjust = 0.5, size = 3) +
  facet_wrap(~variable, scales = "free_x") +
  labs(
    title = "Relationship between country-level variables and trans support",
    x = "Country-level variable",
    y = "Proportion of Yes answers"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Democracy and LGBT Policy indeces seem to have a positive linear relationship with the target variable, while Gender Inequality index has a negative linear relationship. The relationship with GDP per capita is non-linear (logarithmic?). There appears to be a “ceiling” of support, beyond which increases in GDP per capita lose effect.

Boxplots for country-level variables:

ggplot(country_long, aes(y = value, x = variable)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 16, fill = "lightblue") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Distribution of country-level variables", x = NULL, y = "value") +
  facet_wrap(~variable, scales = "free") +
  theme(axis.text.y = element_blank(),  
        axis.ticks.x = element_blank())

country_level_data_cor <- country_level_data %>%
  inner_join(trans_docs_prop, by = c("iso2c" = "country"))

cor_matrix <- cor(country_level_data_cor %>%
                    select(gdp_pc_ppp, democracy_index, gender_inequality_index, 
                           lgbt_policy_index, trans_support),
                  use = "complete.obs")

corrplot(cor_matrix, method = "color", type = "upper", 
         addCoef.col = "black", tl.col = "black",
         col = colorRampPalette(c("blue", "white", "red"))(200))

Some heavy correlations which is a little worrying, but as we are going to use these data to build cross-level interactions, this should not compromise the reliability of our analysis too much.

Paradata

# Convert into factor
paradata <- paradata %>%
  mutate(across(where(~ !is.null(attr(., "labels"))), labelled::to_factor))

# Merging datasets for the analysis
merged_paradata <- merge(data, paradata, by = "serialid", all.x = TRUE)

Number of persons present during interview

library(gmodels)

# Convert NA in a explicit category
merged_paradata$trans_docs <- as.character(merged_paradata$trans_docs)   
merged_paradata$trans_docs[is.na(merged_paradata$trans_docs)] <- "DK"  
merged_paradata$trans_docs <- as.factor(merged_paradata$trans_docs)  

# CrossTable
CrossTable(merged_paradata$p4, merged_paradata$trans_docs,
           digits=2, 
           expected=F, 
           asresid=T, 
           chisq=TRUE, 
           prop.chisq=F, 
           format="SPSS")
## 
##    Cell Contents
## |-------------------------|
## |                   Count |
## |             Row Percent |
## |          Column Percent |
## |           Total Percent |
## |           Adj Std Resid |
## |-------------------------|
## 
## Total Observations in Table:  27438 
## 
##                                  | merged_paradata$trans_docs 
##               merged_paradata$p4 |       DK  |       No  |      Yes  | Row Total | 
## ---------------------------------|-----------|-----------|-----------|-----------|
## Two (interviewer and respondent) |     2806  |     8146  |    12428  |    23380  | 
##                                  |    12.00% |    34.84% |    53.16% |    85.21% | 
##                                  |    85.55% |    84.02% |    85.93% |           | 
##                                  |    10.23% |    29.69% |    45.29% |           | 
##                                  |     0.58  |    -4.10  |     3.54  |           | 
## ---------------------------------|-----------|-----------|-----------|-----------|
##                            Three |      428  |     1316  |     1758  |     3502  | 
##                                  |    12.22% |    37.58% |    50.20% |    12.76% | 
##                                  |    13.05% |    13.57% |    12.16% |           | 
##                                  |     1.56% |     4.80% |     6.41% |           | 
##                                  |     0.52  |     2.97  |    -3.19  |           | 
## ---------------------------------|-----------|-----------|-----------|-----------|
##                             Four |       36  |      189  |      218  |      443  | 
##                                  |     8.13% |    42.66% |    49.21% |     1.61% | 
##                                  |     1.10% |     1.95% |     1.51% |           | 
##                                  |     0.13% |     0.69% |     0.79% |           | 
##                                  |    -2.50  |     3.25  |    -1.49  |           | 
## ---------------------------------|-----------|-----------|-----------|-----------|
##                     Five or more |       10  |       44  |       59  |      113  | 
##                                  |     8.85% |    38.94% |    52.21% |     0.41% | 
##                                  |     0.30% |     0.45% |     0.41% |           | 
##                                  |     0.04% |     0.16% |     0.22% |           | 
##                                  |    -1.02  |     0.80  |    -0.11  |           | 
## ---------------------------------|-----------|-----------|-----------|-----------|
##                     Column Total |     3280  |     9695  |    14463  |    27438  | 
##                                  |    11.95% |    35.33% |    52.71% |           | 
## ---------------------------------|-----------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  26.44717     d.f. =  6     p =  0.0001837381 
## 
## 
##  
##        Minimum expected frequency: 13.50827

The table shows a significant association between the number of people present during the interview and support for transgender people to change their documents. The actual support does not vary too much though, they are all within 3% of each other.

Duration of interview

Regarding the duration of the interview we have 2 options p3 and p3r. We will use p3r to have a simpler analysis

# CrossTable
CrossTable(merged_paradata$p3r, merged_paradata$trans_docs,
           digits=2, 
           expected=F, 
           asresid=T, 
           chisq=TRUE, 
           prop.chisq=F, 
           format="SPSS")
## 
##    Cell Contents
## |-------------------------|
## |                   Count |
## |             Row Percent |
## |          Column Percent |
## |           Total Percent |
## |           Adj Std Resid |
## |-------------------------|
## 
## Total Observations in Table:  27438 
## 
##                     | merged_paradata$trans_docs 
## merged_paradata$p3r |       DK  |       No  |      Yes  | Row Total | 
## --------------------|-----------|-----------|-----------|-----------|
##    Up to 14 minutes |       73  |      224  |      260  |      557  | 
##                     |    13.11% |    40.22% |    46.68% |     2.03% | 
##                     |     2.23% |     2.31% |     1.80% |           | 
##                     |     0.27% |     0.82% |     0.95% |           | 
##                     |     0.85  |     2.43  |    -2.88  |           | 
## --------------------|-----------|-----------|-----------|-----------|
##     15 - 29 minutes |      782  |     2284  |     3456  |     6522  | 
##                     |    11.99% |    35.02% |    52.99% |    23.77% | 
##                     |    23.84% |    23.56% |    23.90% |           | 
##                     |     2.85% |     8.32% |    12.60% |           | 
##                     |     0.10  |    -0.61  |     0.52  |           | 
## --------------------|-----------|-----------|-----------|-----------|
##     30 - 44 minutes |     1429  |     4388  |     6305  |    12122  | 
##                     |    11.79% |    36.20% |    52.01% |    44.18% | 
##                     |    43.57% |    45.26% |    43.59% |           | 
##                     |     5.21% |    15.99% |    22.98% |           | 
##                     |    -0.75  |     2.66  |    -2.06  |           | 
## --------------------|-----------|-----------|-----------|-----------|
##     45 - 59 minutes |      703  |     1993  |     3039  |     5735  | 
##                     |    12.26% |    34.75% |    52.99% |    20.90% | 
##                     |    21.43% |    20.56% |    21.01% |           | 
##                     |     2.56% |     7.26% |    11.08% |           | 
##                     |     0.80  |    -1.04  |     0.48  |           | 
## --------------------|-----------|-----------|-----------|-----------|
##     60 - 74 minutes |      202  |      529  |      908  |     1639  | 
##                     |    12.32% |    32.28% |    55.40% |     5.97% | 
##                     |     6.16% |     5.46% |     6.28% |           | 
##                     |     0.74% |     1.93% |     3.31% |           | 
##                     |     0.48  |    -2.67  |     2.25  |           | 
## --------------------|-----------|-----------|-----------|-----------|
##     75 - 89 minutes |       49  |      123  |      276  |      448  | 
##                     |    10.94% |    27.46% |    61.61% |     1.63% | 
##                     |     1.49% |     1.27% |     1.91% |           | 
##                     |     0.18% |     0.45% |     1.01% |           | 
##                     |    -0.67  |    -3.52  |     3.80  |           | 
## --------------------|-----------|-----------|-----------|-----------|
## 90 minutes and more |       42  |      154  |      219  |      415  | 
##                     |    10.12% |    37.11% |    52.77% |     1.51% | 
##                     |     1.28% |     1.59% |     1.51% |           | 
##                     |     0.15% |     0.56% |     0.80% |           | 
##                     |    -1.16  |     0.76  |     0.02  |           | 
## --------------------|-----------|-----------|-----------|-----------|
##        Column Total |     3280  |     9695  |    14463  |    27438  | 
##                     |    11.95% |    35.33% |    52.71% |           | 
## --------------------|-----------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  36.84302     d.f. =  12     p =  0.0002368813 
## 
## 
##  
##        Minimum expected frequency: 49.61003

It appears that the longer the interview, the more likely to support the people are. The highest support being 75-89 minutes while the highest opposition and and highest DK were in <15 minute interviews.

Respondent cooperation

data_summary <- merged_paradata |> 
  mutate(trans_docs = fct_recode(trans_docs, "Don't Know" = "DK"))|>
  count(p5, trans_docs, name = "n") |> 
  group_by(p5) |>  
  mutate(percentage = n / sum(n))  

ggplot(data_summary, aes(x = p5, y = percentage, fill = trans_docs)) +
  geom_bar(stat = "identity", position = "fill") +
  geom_text(aes(label = scales::percent(percentage, accuracy = 1)),  
            position = position_stack(vjust = 0.5), size = 4) + 
  scale_y_continuous(labels = percent_format()) + 
  scale_fill_manual(
    values = c("Yes" = "#2ECC71", "No" = "#E74C3C", "Don't Know" = "#3498DB")
  ) +
  labs(
    title = "Support for the right of trans people to change their gender in civil documents",
    x = "Respondent cooperation in the interview",
    y = "Proportion of Responses",
    fill = "Support the right"
  ) +
  coord_flip() +  
  theme_minimal()+
  theme(
    panel.grid = element_blank(),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank()
  )

Respondents who engaged better in the survey were more likely to express support. This is the most clear paradata relationship.

Low cooperation had higher DK response and opposition to trans documents support.

Correlation and multicollinearity

Now we want to take a look at the possible multicollinearity of our explanatory variables. For it, we calculate the correlation of our numerical variables. As we have too many variables, we visualize just those with a strong correlation to focus on key dependencies.

cor_matrix <- cor(data %>%
                    select_if(is.numeric) %>%
                    select(-serialid),
                  use = "complete.obs")

corrplot(cor_matrix, method = "color", type = "upper", 
         tl.col = "black",
         col = colorRampPalette(c("blue", "white", "red"))(200))

# Create a matrix where only correlations >= 0.7 are maintained
cor_filtered <- ifelse(abs(cor_matrix) >= 0.7, cor_matrix, NA)

# Graph
corrplot(cor_filtered, method = "color", type = "upper", 
         tl.col = "black", 
         col = colorRampPalette(c("blue", "white", "red"))(200),
         na.label = " ",
         insig = "blank") 

Most strong correlations are among variables related to race, religion or sexual orientation discrimination. Consequently, we are grouping them into 3 new composite variables by averaging related ones, in order to reduce multicollinearity.

# Creating grouped variables
data <- data %>%
  mutate(
    racial_discri = rowMeans(select(., roma_discri, black_discri, asian_discri), na.rm = TRUE), #no white
    sexual_discri = rowMeans(select(., lgb_discri, trans_discri, intersex_discri), na.rm = TRUE),
    religious_discri = rowMeans(select(., jewish_discri, muslim_discri, buddihst_discri), na.rm = TRUE)
  ) 

# Removing original variables to avoid redundancy
data <- data |> 
  select(-c(roma_discri, black_discri, asian_discri, 
            lgb_discri, trans_discri, intersex_discri, 
            jewish_discri, muslim_discri, buddihst_discri))

# Graphical representation
cor_matrix <- cor(data %>%
                    select_if(is.numeric) %>%
                    select(-serialid),
                  use = "complete.obs")


corrplot(cor_matrix, method = "color", type = "upper", 
         tl.col = "black",
         col = colorRampPalette(c("blue", "white", "red"))(200))

We realize our new variables are still highly correlated between them (bottom-right corner). For reducing the dimensionality of the dataset even more while preserving information we create one single variable for general minority discrimination.

# Creating grouped variable
data <- data %>%
  mutate(
    minority_discri = rowMeans(select(., racial_discri, sexual_discri,
                                      religious_discri, disability_discri), na.rm = TRUE)) 

We decided to exclude white_discri, atheist_discri, christian_discri because we realized they had a median of 1 and a quite small mean, meaning there is almost no reported discrimination against these groups in the dataset.

summary(data$white_discri)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   1.000   1.000   1.785   2.000  10.000     403
summary(data$atheist_discri)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   1.000   1.500   2.802   4.000  10.000     517
summary(data$christian_discri)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   1.000   1.000   1.904   2.000  10.000     414
data <- data |> 
  select(-c(white_discri, atheist_discri, christian_discri, racial_discri, 
            sexual_discri, religious_discri, disability_discri))

Final check of the correlation matrix to confirm that multicollinearity has been reduced.

cor_matrix <- cor(data %>%
                    select_if(is.numeric) %>%
                    select(-serialid),
                  use = "complete.obs")


corrplot(cor_matrix, method = "color", type = "upper", 
         addCoef.col = "black", tl.col = "black",
         col = colorRampPalette(c("blue", "white", "red"))(200))

Class imbalance

prop.table(table(data$trans_docs, useNA = "ifany"))
## 
##       Yes        No      <NA> 
## 0.5271157 0.3533421 0.1195422
prop.table(table(data$trans_docs, useNA = "no"))
## 
##       Yes        No 
## 0.5986837 0.4013163

Our target variable has a 60/40 distribution so we are not worried about class imbalance.

Less represented classes in other variables have already been aggregated if deemed necessary.

Analysis of missing values

Here we analyse the NA reponses to our target variable. This is a sort of robustness check to understand our data and learn if there are patterns in NA response that we should be worried about e.g. if there are specific demographics of people who are less likely to respond.

Overall the analysis below shows that non response is similar to the descriptive data and aligned with some groups reported “No” to supporting trans_docs in qc19. So we’re more at risk of underreporting “no” votes in this survey due to non-response. The correlation is low though so it’s not a major concern.

Who is more likely to not respond to our target?

There are 3,280 NA responses to the target variable. To try understand if the missing values are at random, we will test the correlation between NA response and Use the DK (don’t know) for our target variable.

In the descriptive analysis above, we have already seen in raw terms that NA responses were more frequent from some groups e.g.

  • women,

  • older people,

  • people who also responded NA for political ideology,

  • People who do not have trans friends and those who refused to respond to the friendship question had higher NA responses to the target variable.

  • People with lower survey cooperation

  • By country, we already understand from the challenge description that the NA responses vary. This is a significant disparity.

Create a binary variable for the DK

cntry_name <- codelist |> select(iso2c, country_name = country.name.en)
dk_target <- data |> 
  mutate(target_NA = ifelse(is.na(trans_docs), 1, 0)) |> 
  left_join(cntry_name, by = join_by(country == iso2c)) |> 
  dplyr::select(-trans_docs, -serialid, -country)

table(dk_target$target_NA)
## 
##     0     1 
## 24158  3280

Confirm DK rates by country. They vary from 1.4% in Belgium to 28.5% in Bulgaria. Overall, the average is 11.95%.

dk_target |> 
  group_by(country_name) |> 
  summarise(count_na = sum(target_NA),
            num_resp = length(country_name),
            pct_na = count_na/num_resp*100) |> 
  ggplot(aes(x=reorder(country_name, -pct_na), y = pct_na))+
  geom_col()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title.x = element_blank()) +
  scale_y_continuous(labels = scales::percent_format(scale = 1)) + 
  labs(y = "Proportion of NA responses")

If all NA responses to the target are removed, this is how much the “Yes” count increases for each of our countries. We see this is not proportional and the NA vote removal makes our differences in Yes/No differences appear larger.

data |> 
  group_by(country) |> 
  summarise(yes_count = sum(trans_docs == "Yes", na.rm=TRUE),
            num_resp = sum(!is.na(trans_docs)),
            num_na = sum(is.na(trans_docs)),
            pct_yes_withna = yes_count / length(country)*100,
            pct_yes = yes_count/num_resp*100)|> 
  ggplot(aes(x = reorder(country, -pct_yes))) +
  geom_col(aes(y = pct_yes), fill = "red") + # First bar (pct_yes)
  geom_col(aes(y = pct_yes_withna), fill = "yellow", alpha = 0.5) + # Second bar (pct_yes_withna)
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.title.x = element_blank()) +
  scale_y_continuous(labels = scales::percent_format(scale = 1),
                     limits = c(0,100)) + 
  labs(title ="Support for transgender rights to legally change documents, with and without NA's")

Check correlation with DK of target

Run model to test what is correlated with DK response, use Cramers V for association between the factor variables and correlation for the numeric variables in the cleaned data.

  1. Model the data for DK responses against the factor variables.
factor_subset <- dk_target %>%
  dplyr::select(target_NA, where(is.factor)) |> 
  mutate(target_NA = factor(target_NA))

# Run cramers V for factors

target_var <- "target_NA"
factor_variables <- names(factor_subset)
factor_variables <- factor_variables[factor_variables != target_var]

cramers_v_results <- list()

for (variable in factor_variables) {
    contingency_table <- table(factor_subset[[target_var]], 
                               factor_subset[[variable]])
    cramers_v <- CramerV(contingency_table)
    cramers_v_results[[variable]] <- cramers_v
}

# Create a Tibble for Results
cat_results <- tibble(variable = names(cramers_v_results),
       cramers_v = unlist(cramers_v_results)) |> 
  arrange(desc(cramers_v))
cat_results
## # A tibble: 22 × 2
##    variable       cramers_v
##    <chr>              <dbl>
##  1 internet_use      0.136 
##  2 social_class      0.0975
##  3 occupation        0.0852
##  4 marital_status    0.0831
##  5 gender_docs       0.0820
##  6 religion          0.0763
##  7 phone_access      0.0711
##  8 life_sat          0.0642
##  9 polintr           0.0621
## 10 friends_trans     0.0581
## # ℹ 12 more rows

This shows all of the factors have quite low association with the NA responses. The highest being internet use. This is a good sign that the missing values are random.

  1. Look at the numeric variables and test correlations:
numeric_subset <- dk_target %>%
  dplyr::select(target_NA, where(is.numeric)) 

target_var <- "target_NA"
numeric_variables <- names(numeric_subset)[names(numeric_subset) != target_var]

cor_results <- list()

for (variable in numeric_variables) {
    correlation <- cor(numeric_subset$target_NA, 
                       numeric_subset[[variable]],
                       use = "pairwise.complete.obs")
    cor_results[[variable]] <- correlation
}

# Create a Tibble for Results
cor_results <- tibble(variable = names(cor_results),
       cor = unlist(cor_results)) |> 
  arrange(cor)
cor_results
## # A tibble: 7 × 2
##   variable                     cor
##   <chr>                      <dbl>
## 1 n_actions_against_discri -0.0988
## 2 n_friends_minorities     -0.0965
## 3 years_edu                -0.0544
## 4 social_alienation         0.0566
## 5 minority_discri           0.0635
## 6 age                       0.0982
## 7 antilgbtq_rights          0.108

Again, this is not too much cause for concern. We only have around 10% correlations, positive and negative with the target variable.

Modelling relationship with key variables and non-response

Here we run a logistic regression to test for the most important/significant variables. First, we make a subset of data with only our variables that were most correlated in the seciton above (use above +/- 9.5% correlation and above 10% cramers V) and also include gender and country.

cor_results |> filter(cor > 0.09 | cor < -0.9) |> pull(variable)
## [1] "age"              "antilgbtq_rights"
cat_results |> filter(cramers_v >0.1) |> pull(variable)
## [1] "internet_use"

Now model with those variables to see if they are significant in explaining the DK values. Gender will also be included as a key variable.

We run a stepwise AIC on these most correlated variables and then use the best model to test the overall model fit.

# first, run stepwise to determine most important 
# run base model 
dk_fit_null <- glm(target_NA ~ 1,
              data = dk_target, 
              family = "binomial")
# run full fit model of most correlated vars with country (also age^2 for good practice)
dk_fit <- glm(target_NA ~ gender + age + I(age*age) + antilgbtq_rights + internet_use + social_class + country_name, 
              data = dk_target, 
              family = "binomial")

# use stepwise to select best mode l
aic_1 <- MASS::stepAIC(dk_fit, scope = list(upper = dk_fit, 
                             lower = dk_fit_null),
        direction = "both", k = 2, trace=0) # forward based on AIC

# filter vars significant at 5% and calculate exponential
broom::tidy(aic_1) |> 
  filter(p.value<0.05) |> 
  mutate(oddsratio = exp(estimate)) |> 
  select(term, oddsratio, p.value)
## # A tibble: 29 × 3
##    term                                          oddsratio  p.value
##    <chr>                                             <dbl>    <dbl>
##  1 (Intercept)                                      0.0500 8.78e-40
##  2 genderWoman                                      1.20   2.01e- 5
##  3 I(age * age)                                     1.00   4.77e- 4
##  4 antilgbtq_rights                                 1.13   9.69e- 7
##  5 internet_useTwo or three times a week            1.21   2.72e- 2
##  6 internet_useTwo or three times a month           0.307  4.97e- 3
##  7 internet_useNever/No access                      1.28   2.49e- 4
##  8 internet_useNo Internet access at all            1.57   4.62e- 4
##  9 social_classThe lower middle class of society    0.766  4.62e- 5
## 10 social_classThe middle class of society          0.713  4.09e-11
## # ℹ 19 more rows

The best model is the one without age but includes age^2. So we run the full model to keep the lower level variables included.

The model has some interesting findings to who did not respond to our findings. Some things we can see with the odds ratios (excluding interpretation of countries as that is covered above):

  • women are around 20% more likely to give NA responses

  • Anti-lgbti rights people were about 10% more likely to not respond

  • Internet use was significant, those who had not used the internet or did not have access, were less likely to respond compared to those who use the internet everyday.

  • Social class was significant at all levels, this means the working class were more likely to NA respond than all other class levels.


Below is an option using undersampling (to reduce class imbalance of being a NA versus not being a NA for our target variable).

na_model <- glm(target_NA ~ .,
                      data = dk_target, 
                      family = "binomial")

broom::tidy(na_model) |> arrange(desc(estimate))
## # A tibble: 89 × 5
##    term                       estimate std.error statistic       p.value
##    <chr>                         <dbl>     <dbl>     <dbl>         <dbl>
##  1 country_nameBulgaria          1.26     0.346       3.63 0.000280     
##  2 country_nameSweden            0.822    0.320       2.56 0.0103       
##  3 religionMuslim                0.692    0.325       2.13 0.0335       
##  4 country_nameLatvia            0.687    0.321       2.14 0.0323       
##  5 country_nameCyprus            0.619    0.450       1.38 0.169        
##  6 country_nameGermany           0.552    0.303       1.82 0.0683       
##  7 country_nameUnited Kingdom    0.545    0.329       1.66 0.0977       
##  8 gender_docsNo                 0.536    0.0930      5.77 0.00000000796
##  9 country_nameFinland           0.507    0.328       1.54 0.123        
## 10 country_namePoland            0.501    0.326       1.54 0.124        
## # ℹ 79 more rows

Approach to test relationship to DK with splitting and undersampling:

# create new dataset with just vars we want, including age^2
dk_target2 <- dk_target |> 
  select(target_NA, age, antilgbtq_rights, internet_use, social_class, country_name) |> 
  mutate(age_sq = age*age, .after = age)

# to deal with some class imbalance, undersample from the majority group (NA = 0)
set.seed(123)
# Generate an index
index <- createDataPartition(dk_target2$target_NA, p = 0.7, list = FALSE, times = 1)

# Subset the dataframe
train <- dk_target2[index, ]
test <- dk_target2[-index, ]
# check splits 
prop.table(table(train$target_NA))
## 
##         0         1 
## 0.8787421 0.1212579
prop.table(table(test$target_NA))
## 
##         0         1 
## 0.8844612 0.1155388

Now undersample the majority

set.seed(123)
under <- ovun.sample(target_NA~., 
                     data=train, 
                     method = "under", 
                     N = 4000)$data 
    # limit the sample to 4000 obs as we have ~2300 for target minority class. 

table(under$target_NA)
## 
##    0    1 
## 2112 1888
#run a rf model to see how much of the NA values we can explain
rfunder <- randomForest(target_NA~., data=under)
rfunder
## 
## Call:
##  randomForest(formula = target_NA ~ ., data = under) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 0.239057
##                     % Var explained: 4.08
# check a simple logistic model oucomes for undersampled data
broom::tidy(glm(target_NA ~., 
                data = under, 
                family="binomial")) |> 
  filter(p.value<0.05) |> 
  mutate(oddsratio = exp(estimate)) |> 
  select(term, oddsratio, p.value)
## # A tibble: 15 × 3
##    term                                          oddsratio    p.value
##    <chr>                                             <dbl>      <dbl>
##  1 (Intercept)                                       0.451 0.0178    
##  2 antilgbtq_rights                                  1.13  0.00254   
##  3 internet_useTwo or three times a week             1.39  0.0173    
##  4 internet_useNever/No access                       1.55  0.000114  
##  5 internet_useNo Internet access at all             1.85  0.0120    
##  6 social_classThe middle class of society           0.681 0.00000455
##  7 social_classThe upper middle class of society     0.705 0.0297    
##  8 country_nameBelgium                               0.141 0.00000805
##  9 country_nameBulgaria                              2.62  0.0000802 
## 10 country_nameCyprus                                2.72  0.000779  
## 11 country_nameEstonia                               1.65  0.0454    
## 12 country_nameLatvia                                1.88  0.0125    
## 13 country_nameLithuania                             2.06  0.00589   
## 14 country_nameNetherlands                           0.423 0.00793   
## 15 country_namePoland                                2.23  0.00143

We see that only around 5% of the variance is explained through RF. However The undersampled data with a simplified dataframe finds that higher DK were associated with being anti lgbti rights, less frequent internet use and being working class. This is all useful to compare to overall response rates.

Testing the NA values against paradata

Here we extend the model to see if paradata is related to changed non-response.

para <- paradata |> 
  bind_cols(trans_docs = data$trans_docs,
            age = data$age,
            gender = data$gender,
            country = data$country) |> 
  mutate(target_NA = ifelse(is.na(trans_docs), 1, 0)) |> 
  select(-serialid, -trans_docs)

summary(para$target_NA)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1195  0.0000  1.0000
str(para)
## tibble [27,438 × 9] (S3: tbl_df/tbl/data.frame)
##  $ p2       : Factor w/ 6 levels "Before 8 h","8 - 12 h",..: 4 4 3 4 2 4 5 2 4 4 ...
##   ..- attr(*, "label")= chr "TIME OF INTERVIEW"
##  $ p3       : Factor w/ 246 levels "2 minutes","3",..: 44 39 37 30 30 25 26 33 39 39 ...
##   ..- attr(*, "label")= chr "DURATION OF INTERVIEW"
##  $ p3r      : Factor w/ 7 levels "Up to 14 minutes",..: 4 3 3 3 3 2 2 3 3 3 ...
##   ..- attr(*, "label")= chr "DURATION OF INTERVIEW (RECODED)"
##  $ p4       : Factor w/ 5 levels "Two (interviewer and respondent)",..: 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "N OF PERSONS PRESENT DURING INTERVIEW"
##  $ p5       : Factor w/ 5 levels "Excellent","Fair",..: 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "RESPONDENT COOPERATION"
##  $ age      : num [1:27438] 51 62 38 29 63 41 48 88 44 45 ...
##  $ gender   : Factor w/ 2 levels "Man","Woman": 1 2 2 1 2 2 1 1 2 2 ...
##  $ country  : chr [1:27438] "BE" "BE" "BE" "BE" ...
##  $ target_NA: num [1:27438] 0 0 0 0 0 0 0 0 0 0 ...

First, again test correlation. We use Cramers V since we only have factors:

target_var <- "target_NA"
para_vars <- names(para)[names(para) != target_var]

cramers_v_results <- list()

for (variable in para_vars) {
    contingency_table <- table(para[[target_var]], 
                               para[[variable]])
    contingency_table_no_dk <- contingency_table[, colnames(contingency_table) != "DK"]
    cramers_v <- CramerV(contingency_table_no_dk) 
    # include this for no DK as it is all zero which creates errors
    cramers_v_results[[variable]] <- cramers_v
}


# Create a Tibble for Results
tibble(variable = names(cramers_v_results),
       cramers_v = unlist(cramers_v_results)) |> 
  arrange(desc(cramers_v))
## # A tibble: 8 × 2
##   variable cramers_v
##   <chr>        <dbl>
## 1 country     0.168 
## 2 age         0.123 
## 3 p5          0.106 
## 4 p3          0.0916
## 5 gender      0.0292
## 6 p4          0.0165
## 7 p2          0.0116
## 8 p3r         0.0113

Again, we see low correlation with the paradata and cramers V.

Second, we run another logistic model with just Paradata included.

na_para_model <- glm(target_NA ~ p2 + p3r + p4 + p5 + age + I(age*age) + gender + factor(country), 
                     data = para, 
                     family=binomial(link = "logit"))

broom::tidy(na_para_model) |> 
  filter(p.value<0.05) |> 
  mutate(oddsratio = exp(estimate)) |> 
  select(term, oddsratio, p.value)
## # A tibble: 25 × 3
##    term              oddsratio  p.value
##    <chr>                 <dbl>    <dbl>
##  1 (Intercept)          0.0439 4.65e-12
##  2 p4Four               0.694  4.20e- 2
##  3 p5Fair               1.42   2.05e-15
##  4 p5Average            1.74   8.91e-18
##  5 p5Bad                3.49   2.14e-22
##  6 age                  0.983  2.85e- 3
##  7 I(age * age)         1.00   3.43e- 9
##  8 genderWoman          1.14   1.07e- 3
##  9 factor(country)BE    0.179  6.77e- 9
## 10 factor(country)BG    5.30   2.16e-31
## # ℹ 15 more rows

We see that when controlling for age, gender and country, the survey cooperation variable is significant.

When compared to those with an excellent cooperation rating, people were more likely to respond. This increases with lacking cooperation. If cooperation was fair, there was around 40% more chance of NA, if cooperation was average, around 75% higher chance of NA. If cooperation was bad, there was about 350% higher chance of a NA response to our target variable. This variable also has risk of reverse causality, as higher NA respose is likely a reason to be rated as low cooperation.

Overall, this section found that people more likely to respond NA seems similar to those more likely to oppose trans rights to change papers. So there is some risk of under reporting of true support if people opposed are not responding. This is likely capturing some social desirability as people don’t want to appear discriminatory and hide their true views.

Imputing missing data

As we have different types of variables we decide to go with the default methods. Selecting the best method for each type of variable would be too computationally intensive.

We know that it would be better to impute multiple datasets and then pool the results of different regressions on those different datasets together, but running more than one multilevel regression would be too computationally intensive and so we are not able to do this.

# Excluding target variable and other useless variables
imputation_data <- data |> 
  select(-c(serialid, country, trans_docs))

# Running it with the default methods
# By default: numerical variables -> pmm, binary factors -> logreg, > 2 levels factors -> polyreg
imputed_data <- mice(imputation_data, m = 1, seed=1234)
## 
##  iter imp variable
##   1   1  years_edu  community  marital_status  social_class  religion  bill_issues  life_sat  left_right  social_alienation  gender_docs  friends_trans  antilgbtq_rights  minority_discri
##   2   1  years_edu  community  marital_status  social_class  religion  bill_issues  life_sat  left_right  social_alienation  gender_docs  friends_trans  antilgbtq_rights  minority_discri
##   3   1  years_edu  community  marital_status  social_class  religion  bill_issues  life_sat  left_right  social_alienation  gender_docs  friends_trans  antilgbtq_rights  minority_discri
##   4   1  years_edu  community  marital_status  social_class  religion  bill_issues  life_sat  left_right  social_alienation  gender_docs  friends_trans  antilgbtq_rights  minority_discri
##   5   1  years_edu  community  marital_status  social_class  religion  bill_issues  life_sat  left_right  social_alienation  gender_docs  friends_trans  antilgbtq_rights  minority_discri
imputed_data$method
##                      age                   gender                years_edu 
##                       ""                       ""                    "pmm" 
##                community           marital_status               occupation 
##                "polyreg"                "polyreg"                       "" 
##             social_class                 religion           nonEU_national 
##                "polyreg"                "polyreg"                       "" 
##             phone_access              bill_issues             internet_use 
##                       ""                "polyreg"                       "" 
##                 life_sat                  polintr               left_right 
##                "polyreg"                       ""                "polyreg" 
##        social_alienation          ethnic_minority       skincolor_minority 
##                    "pmm"                       ""                       "" 
##       religious_minority            roma_minority          sexual_minority 
##                       ""                       ""                       "" 
##      disability_minority           suffered_discr              gender_docs 
##                       ""                       ""                 "logreg" 
##            friends_trans     n_friends_minorities n_actions_against_discri 
##                "polyreg"                       ""                       "" 
##         antilgbtq_rights          minority_discri 
##                    "pmm"                    "pmm"
final_data <- complete(imputed_data)

colSums(is.na(imputation_data))
##                      age                   gender                years_edu 
##                        0                        0                      387 
##                community           marital_status               occupation 
##                       14                      175                        0 
##             social_class                 religion           nonEU_national 
##                     1021                      483                        0 
##             phone_access              bill_issues             internet_use 
##                        0                      379                        0 
##                 life_sat                  polintr               left_right 
##                      102                        0                     4689 
##        social_alienation          ethnic_minority       skincolor_minority 
##                     1000                        0                        0 
##       religious_minority            roma_minority          sexual_minority 
##                        0                        0                        0 
##      disability_minority           suffered_discr              gender_docs 
##                        0                        0                     3187 
##            friends_trans     n_friends_minorities n_actions_against_discri 
##                      968                        0                        0 
##         antilgbtq_rights          minority_discri 
##                      791                      355
colSums(is.na(final_data))
##                      age                   gender                years_edu 
##                        0                        0                        0 
##                community           marital_status               occupation 
##                        0                        0                        0 
##             social_class                 religion           nonEU_national 
##                        0                        0                        0 
##             phone_access              bill_issues             internet_use 
##                        0                        0                        0 
##                 life_sat                  polintr               left_right 
##                        0                        0                        0 
##        social_alienation          ethnic_minority       skincolor_minority 
##                        0                        0                        0 
##       religious_minority            roma_minority          sexual_minority 
##                        0                        0                        0 
##      disability_minority           suffered_discr              gender_docs 
##                        0                        0                        0 
##            friends_trans     n_friends_minorities n_actions_against_discri 
##                        0                        0                        0 
##         antilgbtq_rights          minority_discri 
##                        0                        0
final_data <- final_data |> 
  cbind(data$trans_docs) |> 
  cbind(data$country) |>
  rename("trans_docs" = "data$trans_docs",
         "country" = "data$country") |> 
  relocate(country, .before = everything())

Joining together all our data

country_level_data stores all our data defined at the country level.

final_data stores all our data (after imputation) for individual level variables.

complete_df <- final_data |> 
  left_join(country_level_data, 
            by = join_by(country == iso2c)) |> 
  select(-c(country.y, iso3c))

Explanatory model

Before running our final explanatory model we run a logistic regression with all our individual level predictors to see which variables are significant.

We also run a stepwise regression and Lasso regression to select the most important subset of predictors.

Comparing the outputs of these 3 regressions can gives us hint about which are the most important variables and levels to include in our final explanatory model.

Simple logistic regression

I am using only individual level data for now. This is a simple model to see at the global level i.e. no country level control, which variables are significant. The numeric data has also been scaled.

datalr <- final_data |> 
  # Dropping NAs (there should be NAs only in our target variable)
  drop_na() |> 
  # Selecting only predictor variables
  select(-"country") |> 
  # Scale variables 
  mutate(across(where(is.numeric), scale)) |> 
  # Transforming target to binary numeric Yes=1, No=0
  mutate(trans_docs = as.numeric(trans_docs == "Yes"))
  
# now run the full model
simple_logistic <- glm(trans_docs ~ ., data = datalr, family = binomial(link = "logit"))

# Save the summary
logistic_summary <- summary(simple_logistic)

# create a tibble to more easily search through significant variables
logistic_results <- tibble(
  broom::tidy(simple_logistic) |>
    filter(p.value<0.05) |>
    mutate(oddsratio = exp(estimate)) |>
    select(term, estimate, oddsratio, p.value) |> 
    arrange(oddsratio))
  
print(logistic_results, n = Inf)
## # A tibble: 26 × 4
##    term                                             estimate oddsratio   p.value
##    <chr>                                               <dbl>     <dbl>     <dbl>
##  1 gender_docsNo                                      -2.66     0.0697 0        
##  2 friends_transRefusal (SPONTANEOUS)                 -0.798    0.450  2.05e-  5
##  3 antilgbtq_rights                                   -0.699    0.497  1.16e-194
##  4 roma_minority1                                     -0.568    0.567  1.43e-  4
##  5 internet_useTwo or three times a week              -0.418    0.658  4.29e-  8
##  6 life_satNot at all satisfied                       -0.395    0.673  3.56e-  4
##  7 internet_useNever/No access                        -0.348    0.706  1.92e-  7
##  8 bill_issuesFrom time to time                       -0.335    0.716  8.36e-  6
##  9 internet_useAbout once a week                      -0.332    0.718  2.13e-  2
## 10 occupationRetired (4 in d15a)                      -0.321    0.725  1.24e-  4
## 11 occupationStudents (2 in d15a)                     -0.294    0.745  9.72e-  3
## 12 occupationOther white collars (13 or 14 in d15a)   -0.263    0.769  1.61e-  3
## 13 minority_discri                                    -0.233    0.792  2.00e- 24
## 14 ethnic_minority1                                   -0.231    0.794  3.62e-  2
## 15 suffered_discr1                                    -0.221    0.802  5.75e-  5
## 16 bill_issuesAlmost never/never                      -0.211    0.810  4.20e-  3
## 17 left_right(7 -10) Right                            -0.166    0.847  6.88e-  4
## 18 occupationManual workers (15 to 18 in d15a)        -0.161    0.851  4.00e-  2
## 19 life_satNot very satisfied                         -0.154    0.858  2.26e-  2
## 20 religionNon-believers                               0.136    1.15   8.95e-  3
## 21 marital_statusSingle (9-10 in d7)                   0.138    1.15   2.42e-  2
## 22 n_friends_minorities                                0.161    1.17   6.76e- 14
## 23 age                                                 0.209    1.23   4.77e- 10
## 24 genderWoman                                         0.263    1.30   5.44e- 12
## 25 phone_accessLandline & mobile                       0.275    1.32   2.97e- 12
## 26 (Intercept)                                         2.69    14.8    4.64e- 77
Interpeting results

To interpret the coefficients as odds ratios, anything above 1 indicates they are more likely to support the changes to civil documents for trans people.

The logistic regression shows that 26 statistically significant terms. This includes factors variables with multiple terms. Of our 29 scaled variables, the significant variables (at the 5% level) are:

More likely to OPPPOSE the right to change civil documents for trans people, include:

  • men (compared to women)

  • younger people (this is in a way unexpected)

  • people who oppose legal rights to add a third-gender in official docs (compared to those who do)

  • people who refused to answer whether they have trans friends (compared to those who do)

  • people who are more financially sound (have fewer bill issues)

  • people not satisfied with their life (compared to those who are ‘very satisfied’)

  • people who identified as roma or an ethnic minority

  • people who are more discriminatory against minorities


More likely to SUPPORT the the right

  • older people

  • women

  • people who were self-employed (compared to all other occupation types incl students)

  • unmarried (single) people

  • non-believers (religious)

  • people with a landline and mobile

  • people who use the internet everyday/almost everyday (compared to all other internet use categories)

  • people who reported being more left wing (compared to right wing)

  • people who had friends in minority groups.


NUMERIC VARIABLES

Without scaling, we interpret logistic regression output as:

coef(simple_logistic): these coefficients represent the change in the log-odds for a one-unit increase in the corresponding independent variable

exp(coef(simple_logistic)): the exponential of the slope coefficient (exp(B)) tells us the change of the odds if the independent variable increases by one unit

If the data is scaled:

A coefficient B (from a logistic regression) now reflects the change in the log-odds for a one standard deviation increase in the predictor variable, rather than a one-unit increase in the original scale.

The odds ratio (exp(B)) now tells you how the odds change with a one standard deviation increase in the predictor variable, rather than a one-unit increase.

Here we compare the pre and post transformation of coefficients for interpretation:

# all the coefficients
head(coef(simple_logistic))
##                         (Intercept)                                 age 
##                          2.69303178                          0.20879798 
##                         genderWoman                           years_edu 
##                          0.26342657                         -0.01215087 
## communitySmall or middle sized town                 communityLarge town 
##                          0.01810837                         -0.04039684
# for their interpretation
head(exp(coef(simple_logistic)))
##                         (Intercept)                                 age 
##                          14.7764069                           1.2321960 
##                         genderWoman                           years_edu 
##                           1.3013817                           0.9879227 
## communitySmall or middle sized town                 communityLarge town 
##                           1.0182733                           0.9604082

FACTOR VARIABLES

Factor variables when exponentialised give the likelihood to respond no/yes relative to the base group in the factor variable. We use the function factors to view the levels and identify the base group.

To assist with interpretation of factor outputs:

Here is a function to identify all the base levels in each factor group. Useful to refer to during analysis of different logistic models.

extract_base_groups <- function(df) {

  # extract factor vars only and list to print values
  factor_vars <- names(final_data)[sapply(final_data, is.factor)]
  base_groups <- list()
  
  # loop through values to check all levels
  for (var in factor_vars) {
    if (length(levels(df[[var]])) > 0) { # Check step - should be TRUE for all selected. 
      contrast_matrix <- contrasts(df[[var]])
      if (is.null(contrast_matrix)) {
        base_groups[[var]] <- levels(df[[var]])[1] # check only - shouldn't ever run
      } else {
        base_group_level <- levels(df[[var]])[rowSums(abs(contrast_matrix)) == 0] # base group is the one with all zeroes, so we extract that
        base_groups[[var]] <- base_group_level
      }
    } else {
      base_groups[[var]] <- NA # Indicate non-factor or factor with no levels.
    }
  }
  return(base_groups)
}

# run function with 'final data' and print a tidy output
factor_bases <- extract_base_groups(final_data)

factor_bases <- as_tibble(factor_bases) |> 
  pivot_longer(cols = everything(), 
               names_to = "vars",
               values_to = "base_group")

print(factor_bases, n=Inf)
## # A tibble: 23 × 2
##    vars                base_group                    
##    <chr>               <chr>                         
##  1 gender              Man                           
##  2 community           Rural area or village         
##  3 marital_status      (Re-)Married (1-4 in d7)      
##  4 occupation          Self-employed (5 to 9 in d15a)
##  5 social_class        The working class of society  
##  6 religion            Catholic                      
##  7 nonEU_national      0                             
##  8 phone_access        Mobile only                   
##  9 bill_issues         Most of the time              
## 10 internet_use        Everyday/Almost everyday      
## 11 life_sat            Very satisfied                
## 12 polintr             Strong                        
## 13 left_right          (1 - 4) Left                  
## 14 ethnic_minority     0                             
## 15 skincolor_minority  0                             
## 16 religious_minority  0                             
## 17 roma_minority       0                             
## 18 sexual_minority     0                             
## 19 disability_minority 0                             
## 20 suffered_discr      0                             
## 21 gender_docs         Yes                           
## 22 friends_trans       Yes                           
## 23 trans_docs          Yes
Model performance - Simple logistic

Here, we test the models ability to classify people as supporting or not supporting the rights.

# AIC of model
simple_logistic$aic
## [1] 19544.18
#predicted probabilities of being a 1 (i.e. a refusal of possibility of trans doc)
predicted_probs <- simple_logistic$fitted.values
# same by running
# predicted_probs <- predict(simple_logistic, type = "response")
head(predicted_probs)
##         1         2         3         4         5         6 
## 0.3463520 0.4321645 0.5170989 0.5799662 0.9141149 0.5131253
# AUC of the model
auc(datalr$trans_docs, predicted_probs)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.8937
# Turning them to classes using custom threshold
predicted_classes <- ifelse(predicted_probs > 0.5, 1, 0)
head(predicted_classes)
## 1 2 3 4 5 6 
## 0 0 1 1 1 1

Get the confusion matrix:

cM <- confusionMatrix(as.factor(datalr$trans_docs), as.factor(predicted_classes))
cM
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0  7608  2087
##          1  2246 12217
##                                           
##                Accuracy : 0.8206          
##                  95% CI : (0.8157, 0.8255)
##     No Information Rate : 0.5921          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.6277          
##                                           
##  Mcnemar's Test P-Value : 0.01638         
##                                           
##             Sensitivity : 0.7721          
##             Specificity : 0.8541          
##          Pos Pred Value : 0.7847          
##          Neg Pred Value : 0.8447          
##              Prevalence : 0.4079          
##          Detection Rate : 0.3149          
##    Detection Prevalence : 0.4013          
##       Balanced Accuracy : 0.8131          
##                                           
##        'Positive' Class : 0               
## 

The simple model predicted people’s response with 82% accuracy. That is quite decent for a basic first model. This shows the response is quite explainable with the data we have. The model is slightly better at predicting those who do not support than those who do. Seen by higher specificity than sensitivity (85% vs 77%). But they are both comparable which is good and confirms we don’t have great class imbalance.

Kappa is basically an accurancy score, but taking into account also the possibility of being right by chance. A value of 0.62 is considered quite good.

plot.roc(datalr$trans_docs, predicted_probs, 
         col="darkblue", 
         print.auc = TRUE,  
         auc.polygon=TRUE,
         max.auc.polygon=TRUE,
         auc.polygon.col="lightblue", 
         print.thres="best")

This suggests changing the threshold for assigning classes to 0.59

Anyway we are not really interested in the prediction power of this model, but we are more interested in understanding which variables are relevant or not

Stepwise regression

Chooses the best simple logistic model based on the lowest AIC achievable

set.seed(123)
stepAIC <- step(simple_logistic, direction = "both", trace=0)
stepAIC
## 
## Call:  glm(formula = trans_docs ~ age + gender + marital_status + occupation + 
##     religion + phone_access + bill_issues + internet_use + life_sat + 
##     left_right + ethnic_minority + roma_minority + suffered_discr + 
##     gender_docs + friends_trans + n_friends_minorities + n_actions_against_discri + 
##     antilgbtq_rights + minority_discri, family = binomial(link = "logit"), 
##     data = datalr)
## 
## Coefficients:
##                                          (Intercept)  
##                                              2.62467  
##                                                  age  
##                                              0.20698  
##                                          genderWoman  
##                                              0.26440  
## marital_statusSingle living with partner (5-8 in d7)  
##                                              0.04042  
##                    marital_statusSingle (9-10 in d7)  
##                                              0.13987  
##    marital_statusDivorced or separated (11-12 in d7)  
##                                              0.10517  
##                    marital_statusWidow (13-14 in d7)  
##                                             -0.10979  
##                occupationManagers (10 to 12 in d15a)  
##                                              0.01744  
##     occupationOther white collars (13 or 14 in d15a)  
##                                             -0.25937  
##          occupationManual workers (15 to 18 in d15a)  
##                                             -0.14375  
##                  occupationHouse persons (1 in d15a)  
##                                              0.05889  
##                     occupationUnemployed (3 in d15a)  
##                                             -0.12589  
##                        occupationRetired (4 in d15a)  
##                                             -0.30665  
##                       occupationStudents (2 in d15a)  
##                                             -0.29364  
##                           religionOrthodox Christian  
##                                             -0.12432  
##                                   religionProtestant  
##                                              0.08546  
##                              religionOther Christian  
##                                             -0.13836  
##                                        religionOther  
##                                              0.06244  
##                                       religionMuslim  
##                                             -0.24374  
##                                religionNon-believers  
##                                              0.13104  
##                            phone_accessLandline only  
##                                             -0.09735  
##                        phone_accessLandline & mobile  
##                                              0.27311  
##                             phone_accessNo telephone  
##                                             -0.19336  
##                         bill_issuesFrom time to time  
##                                             -0.34079  
##                        bill_issuesAlmost never/never  
##                                             -0.22203  
##                internet_useTwo or three times a week  
##                                             -0.41597  
##                        internet_useAbout once a week  
##                                             -0.31328  
##               internet_useTwo or three times a month  
##                                             -0.09201  
##                               internet_useLess often  
##                                              0.02229  
##                          internet_useNever/No access  
##                                             -0.33028  
##                internet_useNo Internet access at all  
##                                             -0.02332  
##                             life_satFairly satisfied  
##                                              0.05142  
##                           life_satNot very satisfied  
##                                             -0.14512  
##                         life_satNot at all satisfied  
##                                             -0.38741  
##                             left_right(5 - 6) Centre  
##                                              0.05631  
##                              left_right(7 -10) Right  
##                                             -0.17298  
##                                     ethnic_minority1  
##                                             -0.26298  
##                                       roma_minority1  
##                                             -0.58673  
##                                      suffered_discr1  
##                                             -0.22381  
##                                        gender_docsNo  
##                                             -2.65983  
##                                      friends_transNo  
##                                             -0.13252  
##                   friends_transRefusal (SPONTANEOUS)  
##                                             -0.79810  
##                                 n_friends_minorities  
##                                              0.15912  
##                             n_actions_against_discri  
##                                              0.03980  
##                                     antilgbtq_rights  
##                                             -0.69753  
##                                      minority_discri  
##                                             -0.23593  
## 
## Degrees of Freedom: 24157 Total (i.e. Null);  24112 Residual
## Null Deviance:       32540 
## Residual Deviance: 19430     AIC: 19530

The stepwise regression has 19 of the 29 variables included in its best model. It is quite consistent with what our significant variables were above, so we will not provide further commentary. Compared to the full model, this reduced model does not provide a statistically significant improvement in fit. But it does give us a more simple model and shows us what individual level variable may be most important overall.

anova(stepAIC, simple_logistic)
## Analysis of Deviance Table
## 
## Model 1: trans_docs ~ age + gender + marital_status + occupation + religion + 
##     phone_access + bill_issues + internet_use + life_sat + left_right + 
##     ethnic_minority + roma_minority + suffered_discr + gender_docs + 
##     friends_trans + n_friends_minorities + n_actions_against_discri + 
##     antilgbtq_rights + minority_discri
## Model 2: trans_docs ~ age + gender + years_edu + community + marital_status + 
##     occupation + social_class + religion + nonEU_national + phone_access + 
##     bill_issues + internet_use + life_sat + polintr + left_right + 
##     social_alienation + ethnic_minority + skincolor_minority + 
##     religious_minority + roma_minority + sexual_minority + disability_minority + 
##     suffered_discr + gender_docs + friends_trans + n_friends_minorities + 
##     n_actions_against_discri + antilgbtq_rights + minority_discri
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1     24112      19434                     
## 2     24096      19420 16   14.121   0.5897

Lasso regularization

We still have too many variables for a mixed model. So here we will use Lasso to keep only our most important variables for the mixed models.

Without this step, the glmer stage is too computationally expensive.

lasso_data <- final_data |> 
  # Doing same preprocessing as done for simple logistic
  drop_na() |> 
  select(-"country") |> 
  mutate(across(where(is.numeric), scale)) |> 
  mutate(trans_docs = as.numeric(trans_docs == "Yes")) 

# Dummifying all levels to see whether some levels are particularly important
lasso_data <- lasso_data %>% 
  dummy_cols(select_columns = names(.)[sapply(., is.factor)],
             remove_selected_columns = TRUE)

# Dropping one dummy to avoid multicollinearity
# I prefer to do this manually so as to be able choose the baseline category
lasso_data <- lasso_data |> 
  select(-c(gender_Man, `community_Large town`,
            `marital_status_Single (9-10 in d7)`, 
            `occupation_Manual workers (15 to 18 in d15a)`, 
            `social_class_The middle class of society`, 
            religion_Catholic, 
            `phone_access_Landline & mobile`, 
            `bill_issues_From time to time`, 
            `internet_use_Everyday/Almost everyday`, 
            `life_sat_Fairly satisfied`, 
            polintr_Medium,
            `left_right_(5 - 6) Centre`,
            ethnic_minority_0,
            skincolor_minority_0,
            religious_minority_0,
            roma_minority_0,
            sexual_minority_0,
            disability_minority_0,
            suffered_discr_0,
            friends_trans_No))

# Convert data into a matrix for glmnet
x_lasso <- lasso_data |> select(-trans_docs) |> as.matrix() 
y_lasso <- lasso_data |> select(trans_docs) |> as.matrix()

set.seed(123)

# Define lamdas
lambda_seq <- 10^seq(-2, -5, length.out = 100)

# Fit the Lasso model (alpha = 1 for Lasso regularization)
lasso_model <- cv.glmnet(x_lasso, y_lasso, family = "binomial", alpha = 1, lambda = lambda_seq)

lasso_model$lambda.1se # The largest lambda within one standard error of lambda.min. 
## [1] 0.006579332
# This results in a simpler model with fewer selected features.

# Get all the coefficients
coefficients <- coef(lasso_model, s = "lambda.1se")
# Filter for the non-zero ones
non_zero_features <- rownames(coefficients)[which(coefficients != 0)]
non_zero_features
##  [1] "(Intercept)"                           
##  [2] "age"                                   
##  [3] "n_friends_minorities"                  
##  [4] "antilgbtq_rights"                      
##  [5] "minority_discri"                       
##  [6] "gender_Woman"                          
##  [7] "occupation_Managers (10 to 12 in d15a)"
##  [8] "religion_Orthodox Christian"           
##  [9] "religion_Non-believers"                
## [10] "phone_access_Mobile only"              
## [11] "phone_access_Landline only"            
## [12] "phone_access_No telephone"             
## [13] "bill_issues_Almost never/never"        
## [14] "internet_use_Two or three times a week"
## [15] "internet_use_Never/No access"          
## [16] "life_sat_Not very satisfied"           
## [17] "life_sat_Not at all satisfied"         
## [18] "left_right_(7 -10) Right"              
## [19] "ethnic_minority_1"                     
## [20] "roma_minority_1"                       
## [21] "suffered_discr_1"                      
## [22] "gender_docs_Yes"                       
## [23] "friends_trans_Refusal (SPONTANEOUS)"

Lasso includes variables similar to what was suggested by our initial logistic regression and the subsequent stepwise.

In particular we can see again that:

  1. age is relevant

  2. It matters whether you have friends that belong to minorities (n_friends_minorities)

  3. Also antilgbtq_rights is clearly relevant

  4. It matters whether you are discriminatory against minorities (minority_disci)

  5. gender is relevant

  6. It matters whether you are a manager (at least compared to the baseline of being a manual worker)

  7. It matters whether you are an Orthodox Cristian (at least compared to the baseline being Catholic)

  8. phone_access has 2 relevant levels compared with the baseline (having landline and mobile phone access)

  9. Not being poor seems to matter (bill_issues_Almost never/never compared to a baseline of having bill issues from time to time)

  10. Not using internet seems to matter (compared to accessing it every day)

  11. Not being happy in life seems to matter

  12. Being right-wing rather than center seems to matter

  13. Being from an ethnic minority seems to matter

  14. Having suffered discrimination seems to matter

  15. Supporting a third-gender option on civil documents seems to matter

  16. Refusing to answer the question on whether you know a transgender person seems to matter

Final variable selection

From these hints above and general intuition, we will select the following variables as our individual level fixed effects in the mixed level model:

  • age

  • gender

  • religion: given that the majority of people in our sample are catholic while the most important levels seem to be non_believers, muslims and Orthodox, instead of using all the levels we are going to use these 3 levels

  • occupation: it is not clear which levels might be relevant so we will not include any of them

  • Place you live does not seem to matter (hence no community variables in our final model)

  • marital_status it is not clear which levels are important so we drop this variable

  • social_class as our descriptive analysis was suggesting that this might be relevant and given that we are not using occupation, we will include all levels of social_class

  • phone_access seems like it might be important, but its interpretation is difficult and I am not sure what this variable can be a proxy of, therefore we exclude it

  • bill_issues will be recoded as a dummy using the level almost never. This means that those with 1s in our dummy experience difficult paying bills on a sporadic or constant basis. This can easily be interpreted as a proxy of whether a person is poor or not.

  • internet_use on a similar concept as the previous point, this variable will be recoded as a dummy storing whether one uses internet every day or not.

  • life_sat: people who are not satisfied in life (Not very satisfied & Not at all satisfied) seem to be really different from the rest. So a dummy will be created using those 2 levels

  • left_right: right wing people seem significantly different than center or left wing so a dummy will be created for right wing people

  • ethnic_minority and roma_minority seem quite important. Given that Roma is an ethnic minority we will keep only the former and discard the latter, which would be redundant information

  • Whether you have suffered discrimination (suffered_discri) matters

  • n_friends_minorites and friends_trans are very similar concepts. n_friends_minorites is derived summing the “yes” answers to “do you know a person that belong to this minority x”, but excluding the transgender minority whose answers are left as a standalone variable (friends_trans). We are going to use directly friends_trans as it is more directly related with our target, discarding n_friends_minorities that likely captures a similar effect

  • antilgbtq_rights is clearly a good predictor. As gender_docs again captures a similar concept we are going to use only antilgbtq_rights

  • minority_disci it is also closely related to the last point above, but we are going to keep it as a more general score of how discriminatory you are

And these will be our pre-defined fixed effects.

# Preparing the dataframe
complete_df <- complete_df |> 
  # Irrelevant variables
  select(-c(years_edu, community, marital_status, occupation,
            nonEU_national, phone_access, polintr, social_alienation,
            gender_docs, n_friends_minorities, 
            n_actions_against_discri, skincolor_minority,
            religious_minority, roma_minority, sexual_minority,
            disability_minority)) |> 
  # Religion
  mutate(non_believer = ifelse(religion == "Non-believers", 1, 0),
    muslim = ifelse(religion == "Muslim", 1, 0),
    orthodox = ifelse(religion == "Orthodox Christian", 1, 0)) |>
  select(-religion) |> 
  # Bill issues
  mutate(poor = ifelse(bill_issues != "Almost never/never", 1, 0)) |> 
  select(-bill_issues) |> 
  # Internet use
  mutate(everyday_internet = ifelse(internet_use == "Everyday/Almost everyday", 1, 0))  |> 
  select(-internet_use) |> 
  # Life satisfaction
  mutate(unhappy = ifelse(life_sat %in% c("Not very satisfied", "Not at all satisfied"), 1, 0)) |> 
  select(-life_sat) |> 
  # Ideology
  mutate(right_wing = ifelse(left_right == "(7 -10) Right", 1, 0)) |> 
  select(-left_right)
str(complete_df)
## 'data.frame':    27438 obs. of  21 variables:
##  $ country                : chr  "BE" "BE" "BE" "BE" ...
##  $ age                    : num  51 62 38 29 63 41 48 88 44 45 ...
##  $ gender                 : Factor w/ 2 levels "Man","Woman": 1 2 2 1 2 2 1 1 2 2 ...
##  $ social_class           : Factor w/ 5 levels "The working class of society",..: 2 3 2 4 4 3 3 3 3 4 ...
##  $ ethnic_minority        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "ARE YOU PART OF X MINORITY"
##  $ suffered_discr         : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "HAVE YOU BEEN SUBJECT TO DISCRIMINATION"
##  $ friends_trans          : Factor w/ 3 levels "Yes","No","Refusal (SPONTANEOUS)": 2 2 2 2 2 2 2 2 2 2 ...
##   ..- attr(*, "label")= chr "CONTACT: TRANSGENDER/TRANSSEXUAL"
##  $ antilgbtq_rights       : num  2 2.33 1.33 1.33 2.67 1.33 1.33 2.33 1.67 1 ...
##   ..- attr(*, "label")= chr "HIGHER -> THEY OPPOSE MORE RIGHTS TO LGBTQ"
##  $ minority_discri        : num  3.38 3.75 3.38 1 3.12 ...
##  $ trans_docs             : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 2 1 1 ...
##   ..- attr(*, "label")= chr "TRANSGENDER - CHANGE CIVIL DOCUMENTS"
##  $ gdp_pc_ppp             : num  60452 60452 60452 60452 60452 ...
##  $ gender_inequality_index: num  0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048 0.048 ...
##  $ lgbt_policy_index      : num  10 10 10 10 10 10 10 10 10 10 ...
##  $ democracy_index        : num  76.4 76.4 76.4 76.4 76.4 76.4 76.4 76.4 76.4 76.4 ...
##  $ non_believer           : num  0 0 1 1 0 1 1 0 0 0 ...
##  $ muslim                 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ orthodox               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ poor                   : num  1 0 1 0 0 0 1 1 0 0 ...
##  $ everyday_internet      : num  1 1 1 1 1 1 1 0 1 1 ...
##  $ unhappy                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ right_wing             : num  1 1 0 1 1 1 0 1 0 1 ...

Mixed model

Random Intercept Model:

Allows the baseline level (intercept) to vary across groups. Assumes the relationship between predictors and outcome (slope) is the same for all groups

In R syntax: y ~ x + (1|group)

Example: Different schools might have different average test scores (random intercepts), but the effect of study hours on test scores is the same across all schools


Random Slope Model:

Allows the effect of a predictor (slope) to vary across groups. Can be used with or without random intercepts.

In R syntax: y ~ x + (0+x|group) for random slope only, or y ~ x + (1+x|group) for both random intercept and slope

Example: The effect of study hours on test scores might be stronger in some schools than others (random slopes)

Random intercept models account for different baselines between groups, while random slope models account for different relationships between predictors and outcomes across groups. When both random intercepts and slopes are included, you’re allowing both the baseline and the effect of predictors to vary by group.


Interactions between group-level (higher-level) variables and individual-level (lower-level) variables in mixed models are called cross-level interactions.

Cross-level interactions are particularly useful in multilevel research because they help you understand how the relationship between an individual-level predictor and the outcome varies as a function of a group-level characteristic.

For example, in educational research:

  1. Individual level: student characteristics (study time, prior knowledge).

  2. Group level: school or classroom characteristics (class size, teaching method).

  3. Cross-level interaction: Does the effect of study time on performance depend on class size?

Running the glmer models

To model, we will follow: Approach to multilevel model building based on Hox (2010)

1/ Null Model (Random Intercept only)

2/ Add independent Level 1 variables

3/ Add independent Level 2 variables

4/ Add random slopes

5/ (Cross-level) interactions

Each step, check whether your model is significantly improved compared to the previous one.

Adjusted ICC: The proportion of total variance explained by the grouping structure (random effects), adjusted to include the variance explained by the fixed effects

Unadjusted ICC: The proportion of total variance explained by the grouping structure (random effects), excluding any variance explained by the fixed effects in the model

Null mixed model - country level random effects only

complete_df <- complete_df |> 
  drop_na() |> # we have NAs only in our target
  mutate(trans_docs = as.numeric(trans_docs == "Yes"),
         ethnic_minority = as.numeric(ethnic_minority) - 1,
         suffered_discr = as.numeric(suffered_discr) - 1) 

# Scaling the variables is necessary otherwise the model cannot run
# However we loose interpretability 
scaled_df <- complete_df |>
  mutate(across(c(age, antilgbtq_rights, minority_discri, gdp_pc_ppp,
                  gender_inequality_index, lgbt_policy_index, 
                  democracy_index), ~scale(.x, center = TRUE, scale = TRUE)))
# Null model random intercept only
glmer_step1 <- glmer(trans_docs ~ (1|country),
                          data = scaled_df, 
                          family=binomial(link="logit"))
# Printing coefficients
nice_table(broom.mixed::tidy(glmer_step1))

effect

group

Term

estimate

std.error

statistic

p

fixed

(Intercept)

0.45

0.19

2.41

.016*

ran_pars

country

sd__(Intercept)

0.98

# Random intercepts coefficients
ranef(glmer_step1)
## $country
##     (Intercept)
## AT  0.020782239
## BE  0.531337356
## BG -2.038432588
## CY -0.290724298
## CZ -0.658351324
## DE  0.875561535
## DK  0.910532764
## EE  0.009170019
## ES  1.763206743
## FI  0.443774385
## FR  0.870511332
## GB  0.402502267
## GR -0.051877595
## HR -0.631306153
## HU -2.093833734
## IE  0.554181536
## IT -0.539249991
## LT -0.741686697
## LU  0.854896968
## LV -0.510739696
## MT  1.282868516
## NL  1.406610447
## PL -0.447199501
## PT  0.782044709
## RO -1.599375278
## SE  0.500727881
## SI -0.334788326
## SK -1.293857996
## 
## with conditional variances for "country"
# AIC
AIC(glmer_step1)
## [1] 28212.65
# AUC
predicted_probs <- predict(glmer_step1)
auc(scaled_df$trans_docs, predicted_probs)
## Area under the curve: 0.7405
# ICC
performance::icc(glmer_step1)
## # Intraclass Correlation Coefficient
## 
##     Adjusted ICC: 0.227
##   Unadjusted ICC: 0.227

The AUC here measures the models ability to determine trans_docs by country by country only differences. The AUC of 0.7405 shows just using country, you have better than random prediction (which would be 0.5).

The ICC represents the proportion of variance in trans_docs attributable to differences between countries. The value of 0.227 shows that country is a significant variable, but indiviudal level varibles must be included to make this a properly good explanatory model.

The AIC (28212) can be used for model performance later. If AIC reduces, the other models are improvments.

Mixed model 1 - adding in individual-level fixed effects

# Random intercept + individual fixed effects
glmer_step2 <- glmer(trans_docs ~ age + gender + social_class +
                       ethnic_minority + suffered_discr + 
                       friends_trans + antilgbtq_rights +
                       minority_discri + non_believer + muslim +
                       orthodox + poor + everyday_internet + 
                       unhappy + right_wing + (1|country),
                     data = scaled_df,
                     family = binomial(link="logit"))
# Printing coefficients
nice_table(broom.mixed::tidy(glmer_step2))

effect

group

Term

estimate

std.error

statistic

p

fixed

(Intercept)

0.57

0.15

3.83

< .001***

fixed

age

0.09

0.02

4.50

< .001***

fixed

genderWoman

0.30

0.03

9.09

< .001***

fixed

social_classThe lower middle class of society

0.02

0.05

0.47

.638

fixed

social_classThe middle class of society

0.05

0.04

1.21

.228

fixed

social_classThe upper middle class of society

-0.10

0.07

-1.35

.177

fixed

social_classThe higher class of society

0.40

0.23

1.76

.078

fixed

ethnic_minority

-0.26

0.09

-2.76

.006**

fixed

suffered_discr

-0.06

0.05

-1.41

.159

fixed

friends_transNo

-0.48

0.06

-7.71

< .001***

fixed

friends_transRefusal (SPONTANEOUS)

-0.81

0.16

-5.05

< .001***

fixed

antilgbtq_rights

-0.93

0.02

-41.41

< .001***

fixed

minority_discri

-0.33

0.02

-16.02

< .001***

fixed

non_believer

0.16

0.05

3.43

.001***

fixed

muslim

-0.11

0.14

-0.77

.439

fixed

orthodox

0.04

0.08

0.55

.585

fixed

poor

-0.04

0.04

-0.99

.321

fixed

everyday_internet

0.42

0.04

9.30

< .001***

fixed

unhappy

-0.18

0.05

-3.80

< .001***

fixed

right_wing

-0.25

0.04

-6.81

< .001***

ran_pars

country

sd__(Intercept)

0.66

# Random intercepts coefficients
ranef(glmer_step2)
## $country
##    (Intercept)
## AT -0.02120341
## BE  0.22958175
## BG -1.11582225
## CY  0.19511419
## CZ -0.48792517
## DE  0.37873507
## DK  0.05356456
## EE  0.58148184
## ES  1.10243054
## FI  0.13358429
## FR  0.29652703
## GB -0.50032611
## GR  0.63616507
## HR -0.23103977
## HU -1.65716237
## IE  0.02869502
## IT -0.46468719
## LT  0.07107211
## LU  0.16907657
## LV  0.33435201
## MT  1.46642982
## NL  0.23828574
## PL -0.04229342
## PT  0.99027518
## RO -0.84670206
## SE -0.75219620
## SI -0.31388036
## SK -0.49254598
## 
## with conditional variances for "country"
# AIC
AIC(glmer_step2)
## [1] 23612.7
# AUC
predicted_probs <- predict(glmer_step2)
auc(scaled_df$trans_docs, predicted_probs)
## Area under the curve: 0.8389
# Turning them to classes using custom threshold
predicted_classes <- ifelse(predicted_probs > 0.5, 1, 0)
# ICC
performance::icc(glmer_step2)
## # Intraclass Correlation Coefficient
## 
##     Adjusted ICC: 0.118
##   Unadjusted ICC: 0.080
# Model comparison
anova(glmer_step1, glmer_step2)
## Data: scaled_df
## Models:
## glmer_step1: trans_docs ~ (1 | country)
## glmer_step2: trans_docs ~ age + gender + social_class + ethnic_minority + suffered_discr + friends_trans + antilgbtq_rights + minority_discri + non_believer + muslim + orthodox + poor + everyday_internet + unhappy + right_wing + (1 | country)
##             npar   AIC   BIC logLik deviance  Chisq Df Pr(>Chisq)    
## glmer_step1    2 28213 28229 -14104    28209                         
## glmer_step2   21 23613 23783 -11785    23571 4637.9 19  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

After adding in the indiviudal level fixed effects, we see:

The model is improved. We have a significantly reduced AIC (23612) and also see improved AUC (up to around 0.84 to show classification power)

The ICC has reduced, this indicates that the indiviudual level variables are now explaining some of the variance previously explained by the country level variable.

Mixed model 2 - adding in country-level fixed effects

# Random intercept + individual fixed effects + country fixed effects
glmer_step3 <- glmer(trans_docs ~ age + gender + social_class +
                       ethnic_minority + suffered_discr + 
                       friends_trans + antilgbtq_rights +
                       minority_discri + non_believer + muslim +
                       orthodox + poor + everyday_internet + 
                       unhappy + right_wing + gdp_pc_ppp +
                       lgbt_policy_index + gender_inequality_index +
                       democracy_index + (1|country),
                     data = scaled_df,
                     family = binomial(link="logit"))
# Printing coefficients
nice_table(broom.mixed::tidy(glmer_step3))

effect

group

Term

estimate

std.error

statistic

p

fixed

(Intercept)

0.58

0.14

4.21

< .001***

fixed

age

0.09

0.02

4.41

< .001***

fixed

genderWoman

0.30

0.03

9.12

< .001***

fixed

social_classThe lower middle class of society

0.03

0.05

0.48

.629

fixed

social_classThe middle class of society

0.05

0.04

1.22

.221

fixed

social_classThe upper middle class of society

-0.10

0.07

-1.36

.173

fixed

social_classThe higher class of society

0.40

0.23

1.77

.077

fixed

ethnic_minority

-0.26

0.09

-2.77

.006**

fixed

suffered_discr

-0.07

0.05

-1.44

.151

fixed

friends_transNo

-0.47

0.06

-7.68

< .001***

fixed

friends_transRefusal (SPONTANEOUS)

-0.81

0.16

-5.04

< .001***

fixed

antilgbtq_rights

-0.92

0.02

-41.25

< .001***

fixed

minority_discri

-0.32

0.02

-15.97

< .001***

fixed

non_believer

0.15

0.05

3.41

.001***

fixed

muslim

-0.11

0.14

-0.79

.432

fixed

orthodox

0.06

0.08

0.76

.449

fixed

poor

-0.04

0.04

-0.95

.342

fixed

everyday_internet

0.42

0.04

9.27

< .001***

fixed

unhappy

-0.18

0.05

-3.76

< .001***

fixed

right_wing

-0.25

0.04

-6.81

< .001***

fixed

gdp_pc_ppp

-0.04

0.14

-0.28

.779

fixed

lgbt_policy_index

0.22

0.16

1.35

.177

fixed

gender_inequality_index

-0.11

0.16

-0.65

.514

fixed

democracy_index

0.01

0.22

0.06

.948

ran_pars

country

sd__(Intercept)

0.59

# Random intercepts coefficients
ranef(glmer_step3)
## $country
##    (Intercept)
## AT -0.18773313
## BE -0.02997742
## BG -0.77924150
## CY  0.67568621
## CZ -0.36217360
## DE  0.18443755
## DK -0.26115941
## EE  0.60416465
## ES  0.78913426
## FI -0.25378560
## FR  0.09223640
## GB -0.76869320
## GR  0.68310163
## HR -0.16486445
## HU -1.29100958
## IE  0.03264299
## IT -0.36311416
## LT  0.36893641
## LU  0.21177553
## LV  0.78404808
## MT  1.13091409
## NL -0.13537367
## PL  0.25821682
## PT  0.72874926
## RO -0.44017346
## SE -1.15291525
## SI -0.31158359
## SK -0.06293617
## 
## with conditional variances for "country"
# AIC
AIC(glmer_step3)
## [1] 23614.63
# AUC
predicted_probs <- predict(glmer_step3)
auc(scaled_df$trans_docs, predicted_probs)
## Area under the curve: 0.8389
# ICC
performance::icc(glmer_step3)
## # Intraclass Correlation Coefficient
## 
##     Adjusted ICC: 0.097
##   Unadjusted ICC: 0.060
# Model comparison
anova(glmer_step1, glmer_step2, glmer_step3)
## Data: scaled_df
## Models:
## glmer_step1: trans_docs ~ (1 | country)
## glmer_step2: trans_docs ~ age + gender + social_class + ethnic_minority + suffered_discr + friends_trans + antilgbtq_rights + minority_discri + non_believer + muslim + orthodox + poor + everyday_internet + unhappy + right_wing + (1 | country)
## glmer_step3: trans_docs ~ age + gender + social_class + ethnic_minority + suffered_discr + friends_trans + antilgbtq_rights + minority_discri + non_believer + muslim + orthodox + poor + everyday_internet + unhappy + right_wing + gdp_pc_ppp + lgbt_policy_index + gender_inequality_index + democracy_index + (1 | country)
##             npar   AIC   BIC logLik deviance    Chisq Df Pr(>Chisq)    
## glmer_step1    2 28213 28229 -14104    28209                           
## glmer_step2   21 23613 23783 -11785    23571 4637.948 19     <2e-16 ***
## glmer_step3   25 23615 23817 -11782    23565    6.069  4     0.1941    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this model, when including country level fixed effects too, we see:

An overfitted model compared to step 2 (indiviudal fixed effects). A better model compared to step 1 (null).The model is overfitted because performance and predictive power do not improve comapred to step 2 but we have additional variables. The AUC is equivalent while the AIC has increased (indicating worse overall model fit).

Mixed model 3 - adding in cross-level interactions and random slopes

This final mixed model includes cross-level interactions between individual and country level data. The following interactions are included in our models:

right_wing:lgbt_policy_index: Tests whether the effect of individual political ideology varies based on a country’s LGBT policy environment. The relationship between conservative views and trans attitudes might be weaker in countries with strong LGBT protections.

social_class:gdp_pc_ppp: Tests whether social class effects differ across countries with varying economic development.

gender:gender_inequality_index: Tests how the relationship between being female and attitudes toward transgender people varies depending on the level of gender inequality in their country.

gender:lgbt_policy_index: Tests how the relationship between an individual’s gender and their attitudes toward transgender people varies depending on the LGBTQ policy environment of their country.

age:gdp_pc_ppp: Tests how the relationship between an individual’s age and their attitudes toward transgender people varies depending on the economic development level of their country.

non_believer:democracy_index: Tests how the relationship between being religious and attitudes toward transgender people vary depending on the democratic context of their country.

non_believer:gender_inequality_index: Tests how the relationship between being religious and attitudes toward transgender people vary depending on the level of gender inequality in their country

unhappy:lgbt_policy_index: Tests how the relationship between an individual’s unhappiness (or life dissatisfaction) and their attitudes toward transgender people varies depending on the LGBTQ policy environment of their country.

poor:democracy_index: Tests how the relationship between an individual’s economic status (being poor) and their attitudes toward transgender people varies depending on the democratic context of their country.

# Adding random slopes and cross level interaction
glmer_step4 <- glmer(trans_docs ~ 
                       # Individual level variables
                       age + gender + social_class +
                       ethnic_minority + suffered_discr + 
                       friends_trans + antilgbtq_rights +
                       minority_discri + non_believer + muslim +
                       orthodox + poor + everyday_internet + 
                       unhappy + right_wing + 
                       # Country level variables
                       gdp_pc_ppp + gender_inequality_index +
                       lgbt_policy_index + democracy_index + 
                       # Cross-level interactions
                       right_wing:lgbt_policy_index +
                       social_class:gdp_pc_ppp +
                       gender:gender_inequality_index +
                       gender:lgbt_policy_index +
                       age:gdp_pc_ppp +
                       non_believer:democracy_index +
                       non_believer:gender_inequality_index +
                       unhappy:lgbt_policy_index +
                       poor:democracy_index +
                       (1 + gdp_pc_ppp +
                       lgbt_policy_index + gender_inequality_index +
                       democracy_index |country),
                     data = scaled_df,
                     family = binomial(link="logit"))

Run the diagnostics:

# Printing coefficients
nice_table(broom.mixed::tidy(glmer_step4))

effect

group

Term

estimate

std.error

statistic

p

fixed

(Intercept)

0.54

0.12

4.40

< .001***

fixed

age

0.09

0.02

4.56

< .001***

fixed

genderWoman

0.31

0.03

9.33

< .001***

fixed

social_classThe lower middle class of society

0.03

0.05

0.64

.524

fixed

social_classThe middle class of society

0.04

0.04

1.00

.316

fixed

social_classThe upper middle class of society

-0.06

0.08

-0.78

.436

fixed

social_classThe higher class of society

0.40

0.23

1.74

.082

fixed

ethnic_minority

-0.26

0.09

-2.73

.006**

fixed

suffered_discr

-0.07

0.05

-1.58

.115

fixed

friends_transNo

-0.46

0.06

-7.51

< .001***

fixed

friends_transRefusal (SPONTANEOUS)

-0.79

0.16

-4.94

< .001***

fixed

antilgbtq_rights

-0.92

0.02

-41.09

< .001***

fixed

minority_discri

-0.32

0.02

-15.89

< .001***

fixed

non_believer

0.14

0.05

2.91

.004**

fixed

muslim

-0.10

0.14

-0.74

.457

fixed

orthodox

0.06

0.08

0.71

.478

fixed

poor

-0.04

0.04

-1.03

.303

fixed

everyday_internet

0.42

0.04

9.30

< .001***

fixed

unhappy

-0.18

0.05

-3.58

< .001***

fixed

right_wing

-0.26

0.04

-7.02

< .001***

fixed

gdp_pc_ppp

-0.07

0.12

-0.60

.552

fixed

gender_inequality_index

0.09

0.19

0.44

.657

fixed

lgbt_policy_index

0.10

0.15

0.69

.490

fixed

democracy_index

0.19

0.14

1.33

.184

fixed

right_wing × lgbt_policy_index

-0.08

0.04

-2.14

.033*

fixed

social_classThe lower middle class of society × gdp_pc_ppp

0.05

0.06

0.81

.419

fixed

social_classThe middle class of society × gdp_pc_ppp

-0.03

0.04

-0.79

.428

fixed

social_classThe upper middle class of society × gdp_pc_ppp

-0.12

0.07

-1.88

.061

fixed

social_classThe higher class of society × gdp_pc_ppp

-0.10

0.23

-0.43

.668

fixed

genderWoman × gender_inequality_index

-0.03

0.05

-0.66

.507

fixed

genderWoman × lgbt_policy_index

0.12

0.04

2.70

.007**

fixed

age × gdp_pc_ppp

-0.01

0.02

-0.68

.494

fixed

non_believer × democracy_index

-0.19

0.07

-2.82

.005**

fixed

non_believer × gender_inequality_index

-0.25

0.08

-3.27

.001**

fixed

unhappy × lgbt_policy_index

-0.02

0.05

-0.34

.737

fixed

poor × democracy_index

-0.02

0.04

-0.49

.623

ran_pars

country

sd__(Intercept)

0.37

ran_pars

country

cor__(Intercept).gdp_pc_ppp

-1.00

ran_pars

country

cor__(Intercept).lgbt_policy_index

0.65

ran_pars

country

cor__(Intercept).gender_inequality_index

0.96

ran_pars

country

cor__(Intercept).democracy_index

1.00

ran_pars

country

sd__gdp_pc_ppp

0.40

ran_pars

country

cor__gdp_pc_ppp.lgbt_policy_index

-0.58

ran_pars

country

cor__gdp_pc_ppp.gender_inequality_index

-0.93

ran_pars

country

cor__gdp_pc_ppp.democracy_index

-1.00

ran_pars

country

sd__lgbt_policy_index

0.26

ran_pars

country

cor__lgbt_policy_index.gender_inequality_index

0.84

ran_pars

country

cor__lgbt_policy_index.democracy_index

0.59

ran_pars

country

sd__gender_inequality_index

0.49

ran_pars

country

cor__gender_inequality_index.democracy_index

0.93

ran_pars

country

sd__democracy_index

0.50

# Random intercepts coefficients
ranef(glmer_step4)
## $country
##     (Intercept)   gdp_pc_ppp lgbt_policy_index gender_inequality_index
## AT -0.016719005  0.020873947       0.007983478            -0.010045127
## BE -0.318782188  0.350559139      -0.093523634            -0.368758666
## BG -0.381919533  0.407953234      -0.174391897            -0.486754944
## CY  0.113163956 -0.120796506       0.052093931             0.144530695
## CZ -0.322175700  0.348228909      -0.125916831            -0.395326763
## DE  0.176589302 -0.185921585       0.094645411             0.235166631
## DK -0.146077705  0.159026821      -0.051207615            -0.175001333
## EE  0.501492482 -0.549908689       0.155277926             0.585990872
## ES  0.660784593 -0.698108820       0.341708892             0.870999883
## FI -0.096169329  0.102463999      -0.045263945            -0.123541726
## FR  0.145330204 -0.151717571       0.084588619             0.198367977
## GB -0.271699605  0.283806515      -0.157283946            -0.370237280
## GR  0.449724581 -0.484343523       0.184825627             0.558368533
## HR -0.004704177  0.005020313      -0.002171097            -0.006012125
## HU -0.604234263  0.632083378      -0.344992960            -0.819917726
## IE -0.007134497  0.007848001      -0.002082255            -0.008245053
## IT  0.153016952 -0.150835923       0.135193338             0.242127626
## LT  0.260122286 -0.304676764      -0.020156806             0.231331613
## LU -0.168878442  0.181313656      -0.072330630            -0.211785793
## LV  0.304476982 -0.336731101       0.079468264             0.345101164
## MT  0.321079402 -0.328086018       0.223683280             0.464795636
## NL  0.040377234 -0.042326200       0.022598421             0.054461585
## PL -0.285184234  0.279540463      -0.260145318            -0.457162212
## PT  0.486296593 -0.513516336       0.252763616             0.641930301
## RO -0.333940319  0.351237330      -0.180794860            -0.446022702
## SE -0.601343420  0.640784725      -0.282614804            -0.772199797
## SI  0.058640358 -0.025229484       0.220535211             0.214468143
## SK -0.123930094  0.133900778      -0.048700512            -0.152259669
##    democracy_index
## AT    -0.025504649
## BE    -0.440241245
## BG    -0.515749395
## CY     0.152739227
## CZ    -0.439042851
## DE     0.235842678
## DK    -0.200169528
## EE     0.691038067
## ES     0.884839697
## FI    -0.129615243
## FR     0.192839634
## GB    -0.360680737
## GR     0.611161669
## HR    -0.006348189
## HU    -0.803017432
## IE    -0.009855046
## IT     0.194393354
## LT     0.377311797
## LU    -0.228952308
## LV     0.422333100
## MT     0.419145087
## NL     0.053746082
## PL    -0.360766432
## PT     0.650945997
## RO    -0.445651238
## SE    -0.810558118
## SI     0.042874169
## SK    -0.168835330
## 
## with conditional variances for "country"
# AIC
AIC(glmer_step4)
## [1] 23606.46
# AUC
predicted_probs <- predict(glmer_step4)
auc(scaled_df$trans_docs, predicted_probs)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.8394
# ICC
performance::icc(glmer_step4)
## boundary (singular) fit: see help('isSingular')
## # Intraclass Correlation Coefficient
## 
##     Adjusted ICC: 0.094
##   Unadjusted ICC: 0.060
# Model comparison
anova(glmer_step1, glmer_step2, glmer_step3, glmer_step4)
## Data: scaled_df
## Models:
## glmer_step1: trans_docs ~ (1 | country)
## glmer_step2: trans_docs ~ age + gender + social_class + ethnic_minority + suffered_discr + friends_trans + antilgbtq_rights + minority_discri + non_believer + muslim + orthodox + poor + everyday_internet + unhappy + right_wing + (1 | country)
## glmer_step3: trans_docs ~ age + gender + social_class + ethnic_minority + suffered_discr + friends_trans + antilgbtq_rights + minority_discri + non_believer + muslim + orthodox + poor + everyday_internet + unhappy + right_wing + gdp_pc_ppp + lgbt_policy_index + gender_inequality_index + democracy_index + (1 | country)
## glmer_step4: trans_docs ~ age + gender + social_class + ethnic_minority + suffered_discr + friends_trans + antilgbtq_rights + minority_discri + non_believer + muslim + orthodox + poor + everyday_internet + unhappy + right_wing + gdp_pc_ppp + gender_inequality_index + lgbt_policy_index + democracy_index + right_wing:lgbt_policy_index + social_class:gdp_pc_ppp + gender:gender_inequality_index + gender:lgbt_policy_index + age:gdp_pc_ppp + non_believer:democracy_index + non_believer:gender_inequality_index + unhappy:lgbt_policy_index +      poor:democracy_index + (1 + gdp_pc_ppp + lgbt_policy_index + gender_inequality_index + democracy_index | country)
##             npar   AIC   BIC logLik deviance    Chisq Df Pr(>Chisq)    
## glmer_step1    2 28213 28229 -14104    28209                           
## glmer_step2   21 23613 23783 -11785    23571 4637.948 19  < 2.2e-16 ***
## glmer_step3   25 23615 23817 -11782    23565    6.069  4  0.1940554    
## glmer_step4   51 23606 24019 -11752    23504   60.170 26  0.0001591 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Class predictions
predicted_classes <- ifelse(predicted_probs > 0.5, 1, 0)
# Confusion matrix
confusionMatrix(as.factor(scaled_df$trans_docs), as.factor(predicted_classes))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0  7468  2227
##          1  3433 11030
##                                          
##                Accuracy : 0.7657         
##                  95% CI : (0.7603, 0.771)
##     No Information Rate : 0.5488         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.5222         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.6851         
##             Specificity : 0.8320         
##          Pos Pred Value : 0.7703         
##          Neg Pred Value : 0.7626         
##              Prevalence : 0.4512         
##          Detection Rate : 0.3091         
##    Detection Prevalence : 0.4013         
##       Balanced Accuracy : 0.7585         
##                                          
##        'Positive' Class : 0              
## 
plot.roc(scaled_df$trans_docs, predicted_probs, 
         col="darkblue", 
         print.auc = TRUE,  
         auc.polygon=TRUE,
         max.auc.polygon=TRUE,
         auc.polygon.col="lightblue", 
         print.thres="best")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

This model, although computationally expensive, is a marginal improvement on the previous best model (model 1, individual fixed effects). The model fit has improved. We see that in this full model, the AIC dropped to 23606. However, the BIC increased from the model 1, so that does create some uncertainty.

The predictive power actual increases very very marginally. The AUC increases from 0.8389 to 0.8394 in the full model. This may indicate the predictive power is maximised around 84%, given such minor improvements, the model with only individual level fixed effects is preferred.

Cross-level effects

When looking at just the cross-level effects, the significant relationships and the direction of the relationship were: right_wing × lgbt_policy_index (negative) genderWoman × lgbt_policy_index (positive) non_believer × democracy_index (negative) non_believer × gender_inequality_index (negative)

These effects are interpreted as the relationships between an individual-level variable and the outcome variable changes depending on the value of a country-level variable.

For the right wing x lgbti policy variable, as the coefficient is negative, we infer that the effect of being right wing is weaker in countries with higher lgbti policies. So more mild right wing opposition to transgender documentation occurs in more progressive countries (those with stronger lgbti views). This makes sense and maybe helps capture some of the political opinion subjectivity between individuals in differenet countries.

For the women x lgbti policy variable, suggests that the effect of being a women in countries with stronger lgbti policies is stronger. This is more difficult to interpret.

For non-believer x democracy index and non-believer x gender inequality index, both are negative. This implies that the effect of being a non-believer is weaker in countries with with higher democracy score and also a higher gender inequality score (as this is a negative scale, the negative relationship indicatesa a positive correlation). This could align with more progressive democratic countries having lower levels of religion, more gender equality policies and higher support for our target variable.

Interpreting results

The best model is the full model with all of the individual-level, country-level and cross-level effects included. However the gains are very marginal from the previous best model. So depending on needs, the model 1 may be the better option due to simplicity in interpretation and efficiency of computation.

These mixed models allow us to select the best level of model to use. The best model is model 1, with individual level fixed effects and a country level random slope. This is because it has good predictive power and does not overfit the model. The final model with cross-level effects is very computationally expensive and offers little improvement on the more simple model. For this, we prefer the model with indiviudal level fixed effects and a random slope for country as the best model.

Key variables

The key variables in the preferred model that are significant include

Model classification power

Here we can show the distribution of our preferred model, with random slopes for country and fixed individual effects.

# summarise predictions
prediction_summary <- scaled_df |> 
  select(trans_docs, 
         country) |> 
  mutate(null_scaled = predict(glmer_step1),
         null_model_pred = ifelse(null_scaled > 0.5, 1, 0),
         m1_preds = predict(glmer_step2),
         m1_model_pred = ifelse(m1_preds > 0.5, 1, 0),
         m2_preds = predict(glmer_step3),
         m2_model_pred = ifelse(m2_preds > 0.5, 1, 0),
         m3_preds = predict(glmer_step4), 
         m3_model_pred = ifelse(m3_preds > 0.5, 1, 0))

# calculate rates for each model at country level and pivot longer
summary_table <- prediction_summary %>%
  select(trans_docs, country, null_model_pred, m1_model_pred, m2_model_pred, m3_model_pred) |>
  group_by(country) %>%
  summarise(
    actual_rate = mean(trans_docs)*100,
    null_rate = mean(null_model_pred)*100,
    m1_rate = mean(m1_model_pred)*100,
    m2_rate = mean(m2_model_pred)*100,
    m3_rate = mean(m3_model_pred)*100)

#plot all of the model estimates on a graph
# with unpivoted data
summary_table %>%
  mutate(
    null_model = "Null Model",
    model_1 = "Model 1 (+individual fixed)",
    model_2 = "Model 2 (+country fixed)",
    model_3 = "Model 3 (+cross-level effects)"
  ) %>%
  ggplot(aes(x = reorder(country, -actual_rate), y = actual_rate)) +
  geom_col(alpha = 0.5) +
  geom_point(aes(y = null_rate, color = null_model),
             size = 3, shape = 18,
             position = position_nudge(x = -0.3)) +
  geom_point(aes(y = m1_rate, color = model_1),
             size = 3, shape = 15,
             position = position_nudge(x = -0.1)) +
  geom_point(aes(y = m2_rate, color = model_2),
             size = 3, shape = 16,
             position = position_nudge(x = 0.1)) +
  geom_point(aes(y = m3_rate, color = model_3),
             size = 3, shape = 17,
             position = position_nudge(x = 0.3)) +
  scale_color_manual(name = "Mixed model Type",
                       values = c("Null Model" = "blue",
                                  "Model 1 (+individual fixed)" = "red",
                                  "Model 2 (+country fixed)" = "green",
                                  "Model 3 (+cross-level effects)" = "purple")) +
  theme(axis.text.x = element_text(angle = 90, hjust = 0.5),
        axis.title.x = element_blank())+
  scale_y_continuous(labels = percent_format(scale = 1)) +
  labs(title = "Comparing explantory mixed models classification power",
       y = "Proportion of support",
       x = "Countries, ordered by actual support") +
  theme_minimal()+
  theme(panel.grid = element_blank(),
        legend.position = c(0.98, 0.98),  # Adjust coordinates for top right
        legend.justification = c("right", "top"),
        legend.background = element_rect(fill = "white", color = "gray"))
## Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
## 3.5.0.
## ℹ Please use the `legend.position.inside` argument of `theme()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

From this section we can see that the 3 GLM models are very similar. The step 2 (red) and step 3 (green) models are indistinguishable. There is some small difference in country level classification power to the step 4 cross-level effects if you look very closely on some countries. However, these models have similar trends. They classify observations that have a more even split of support. While the countries with more extreme support/opposition provide a class imbalance challenge for the explanatory model.

Predictive model

Developing a Predictive Model for Other Countries

  1. Summarize individual data to a country level (ex. percentage of population living in rural areas by country, percentage of people from minority x by country). Target variable will become % of people in country x that support the right for transgender people to change their civil documents.

  2. Then run random forest/gradient boosting/elastic-net linear regression with leave-one-out cross validation.

Leave-One-Out Cross-Validation (LOOCV): LOOCV is a special case of k-fold cross-validation, where k = the number of observations (n). That means: For each observation (1 out of 28), the model is trained on the remaining 27 rows and tested on the left-out row. This process repeats 28 times (once per row). The final performance metric (e.g., accuracy, RMSE) is the average of all 28 test results.

We switched to country level predictions because this is as we understood the task.

Given that by switching to country level data we only have 28 observations it does not make sense to split our dataframe into test and training for model comparison. Instead we will exploit the fact of having so few rows to run leave-one-out cross validation for hyperparameter tuning and to compare our models.

It is also worth noting that the generalizability of our model is very low. Partially because of the few observations, partially because it is trained only on data from European Union countries. These countries share many similarities. The models we derive could be used to predict support for trans_docs in other European countries, but these models’ findings are likely to be extremely wrong for countries outside Europe, or with a socio-economic structure completely different than that of EU countries.

Aggregating individual data to country level

We should be building a dataframe with 28 rows (1 per country), summarizing all our variables at the country level.

In the next steps we will synthesize the information from the individual respondents into country-level variables.

Using means for numeric variables:

# using final_data which contains individual level data only
names(select(final_data, where(is.numeric))) 
## [1] "age"                      "years_edu"               
## [3] "social_alienation"        "n_friends_minorities"    
## [5] "n_actions_against_discri" "antilgbtq_rights"        
## [7] "minority_discri"
data_num <- final_data |> 
  group_by(country) |> 
  summarise(across(where(is.numeric), ~mean(.), .names = "mean_{.col}"))

As well as for binary dummy variables:

# Searching for factor variables with two levels
dummy_names <- names(final_data)[sapply(final_data, 
                                        is.factor) & 
                                   sapply(final_data, 
                                          function(x) length(levels(x)) == 2)]
dummy_names
##  [1] "gender"              "nonEU_national"      "ethnic_minority"    
##  [4] "skincolor_minority"  "religious_minority"  "roma_minority"      
##  [7] "sexual_minority"     "disability_minority" "suffered_discr"     
## [10] "gender_docs"         "trans_docs"
dummy_levels <- lapply(final_data[dummy_names], levels)
dummy_levels
## $gender
## [1] "Man"   "Woman"
## 
## $nonEU_national
## [1] "0" "1"
## 
## $ethnic_minority
## [1] "0" "1"
## 
## $skincolor_minority
## [1] "0" "1"
## 
## $religious_minority
## [1] "0" "1"
## 
## $roma_minority
## [1] "0" "1"
## 
## $sexual_minority
## [1] "0" "1"
## 
## $disability_minority
## [1] "0" "1"
## 
## $suffered_discr
## [1] "0" "1"
## 
## $gender_docs
## [1] "Yes" "No" 
## 
## $trans_docs
## [1] "Yes" "No"
data_dummy <- final_data %>%
  select(country, all_of(dummy_names)) %>%
  mutate(across(all_of(dummy_names),
                ~ case_when(
                  # convert "0"/"1" to 0/1
                  . %in% c("0", "1") ~ as.numeric(as.character(.)),
                  # convert "Yes"/"No" to 1/0
                  . %in% c("Yes", "No") ~ recode(., "Yes" = 1, "No" = 0),
                  # convert "Man"/"Woman" to 1/0
                  . %in% c("Man", "Woman") ~ recode(., "Man" = 1, "Woman" = 0), 
                  TRUE ~ NA 
                ),
                .names = "{.col}_num"))

We convert factors with 2 levels to numerical 0-1 so we can compute their means.

# Country means
data_dummy <- data_dummy %>%
  group_by(country) %>%
  summarise(
    across(ends_with("_num"), ~ mean(.x, na.rm = TRUE)),
    .groups = "drop"
  )

# substitute the "_num" suffix with a "mean_" prefix so it has the same format as the variables in data_num
data_dummy <- data_dummy %>%
  rename_with(~ paste0("mean_", str_remove(., "_num$")), ends_with("_num"))

Now we can join the two datasets.

data_aggr <- data_num %>%
  left_join(data_dummy, by = "country")

The next variables are factor with >2 levels:

factor_names <- names(final_data)[!sapply(final_data, is.numeric) &
                                  !(sapply(final_data, is.factor) & 
                                      sapply(final_data, 
                                             function(x) length(levels(x)) == 2))]
factor_names
##  [1] "country"        "community"      "marital_status" "occupation"    
##  [5] "social_class"   "religion"       "phone_access"   "bill_issues"   
##  [9] "internet_use"   "life_sat"       "polintr"        "left_right"    
## [13] "friends_trans"
# checking the levels
lapply(final_data[factor_names], levels)
## $country
## NULL
## 
## $community
## [1] "Rural area or village"      "Small or middle sized town"
## [3] "Large town"                
## 
## $marital_status
## [1] "(Re-)Married (1-4 in d7)"              
## [2] "Single living with partner (5-8 in d7)"
## [3] "Single (9-10 in d7)"                   
## [4] "Divorced or separated (11-12 in d7)"   
## [5] "Widow (13-14 in d7)"                   
## 
## $occupation
## [1] "Self-employed (5 to 9 in d15a)"        
## [2] "Managers (10 to 12 in d15a)"           
## [3] "Other white collars (13 or 14 in d15a)"
## [4] "Manual workers (15 to 18 in d15a)"     
## [5] "House persons (1 in d15a)"             
## [6] "Unemployed (3 in d15a)"                
## [7] "Retired (4 in d15a)"                   
## [8] "Students (2 in d15a)"                  
## 
## $social_class
## [1] "The working class of society"      "The lower middle class of society"
## [3] "The middle class of society"       "The upper middle class of society"
## [5] "The higher class of society"      
## 
## $religion
## [1] "Catholic"           "Orthodox Christian" "Protestant"        
## [4] "Other Christian"    "Other"              "Muslim"            
## [7] "Non-believers"     
## 
## $phone_access
## [1] "Mobile only"       "Landline only"     "Landline & mobile"
## [4] "No telephone"     
## 
## $bill_issues
## [1] "Most of the time"   "From time to time"  "Almost never/never"
## 
## $internet_use
## [1] "Everyday/Almost everyday"   "Two or three times a week" 
## [3] "About once a week"          "Two or three times a month"
## [5] "Less often"                 "Never/No access"           
## [7] "No Internet access at all" 
## 
## $life_sat
## [1] "Very satisfied"       "Fairly satisfied"     "Not very satisfied"  
## [4] "Not at all satisfied"
## 
## $polintr
## [1] "Strong"     "Medium"     "Low"        "Not at all"
## 
## $left_right
## [1] "(1 - 4) Left"   "(5 - 6) Centre" "(7 -10) Right" 
## 
## $friends_trans
## [1] "Yes"                   "No"                    "Refusal (SPONTANEOUS)"
# removing country from the list
factor_names <- setdiff(factor_names, "country")

# creating the aggregated dataset for factor variables
data_factors <- final_data %>%
  select(country, all_of(factor_names)) %>%
  pivot_longer(cols = -country, names_to = "variable", values_to = "level") %>%
  group_by(country, variable, level) %>%
  summarise(count = n(), .groups = "drop") %>%
  group_by(country, variable) %>%
  mutate(proportion = count / sum(count)) %>%
  select(-count) %>%
  pivot_wider(names_from = c(variable, level), values_from = proportion, names_glue = "{variable}_{level}")

colSums(is.na(data_factors))
##                                               country 
##                                                     0 
##                          bill_issues_Most of the time 
##                                                     0 
##                         bill_issues_From time to time 
##                                                     0 
##                        bill_issues_Almost never/never 
##                                                     0 
##                       community_Rural area or village 
##                                                     0 
##                  community_Small or middle sized town 
##                                                     0 
##                                  community_Large town 
##                                                     0 
##                                     friends_trans_Yes 
##                                                     0 
##                                      friends_trans_No 
##                                                     0 
##                   friends_trans_Refusal (SPONTANEOUS) 
##                                                     0 
##                 internet_use_Everyday/Almost everyday 
##                                                     0 
##                internet_use_Two or three times a week 
##                                                     0 
##                        internet_use_About once a week 
##                                                     1 
##               internet_use_Two or three times a month 
##                                                     0 
##                               internet_use_Less often 
##                                                     0 
##                          internet_use_Never/No access 
##                                                     0 
##                internet_use_No Internet access at all 
##                                                     0 
##                               left_right_(1 - 4) Left 
##                                                     0 
##                             left_right_(5 - 6) Centre 
##                                                     0 
##                              left_right_(7 -10) Right 
##                                                     0 
##                               life_sat_Very satisfied 
##                                                     0 
##                             life_sat_Fairly satisfied 
##                                                     0 
##                           life_sat_Not very satisfied 
##                                                     0 
##                         life_sat_Not at all satisfied 
##                                                     0 
##               marital_status_(Re-)Married (1-4 in d7) 
##                                                     0 
## marital_status_Single living with partner (5-8 in d7) 
##                                                     0 
##                    marital_status_Single (9-10 in d7) 
##                                                     0 
##    marital_status_Divorced or separated (11-12 in d7) 
##                                                     0 
##                    marital_status_Widow (13-14 in d7) 
##                                                     0 
##             occupation_Self-employed (5 to 9 in d15a) 
##                                                     0 
##                occupation_Managers (10 to 12 in d15a) 
##                                                     0 
##     occupation_Other white collars (13 or 14 in d15a) 
##                                                     0 
##          occupation_Manual workers (15 to 18 in d15a) 
##                                                     0 
##                  occupation_House persons (1 in d15a) 
##                                                     0 
##                     occupation_Unemployed (3 in d15a) 
##                                                     0 
##                        occupation_Retired (4 in d15a) 
##                                                     0 
##                       occupation_Students (2 in d15a) 
##                                                     0 
##                              phone_access_Mobile only 
##                                                     0 
##                            phone_access_Landline only 
##                                                     0 
##                        phone_access_Landline & mobile 
##                                                     0 
##                             phone_access_No telephone 
##                                                     1 
##                                        polintr_Strong 
##                                                     0 
##                                        polintr_Medium 
##                                                     0 
##                                           polintr_Low 
##                                                     0 
##                                    polintr_Not at all 
##                                                     0 
##                                     religion_Catholic 
##                                                     0 
##                           religion_Orthodox Christian 
##                                                     0 
##                                   religion_Protestant 
##                                                     1 
##                              religion_Other Christian 
##                                                     2 
##                                        religion_Other 
##                                                     0 
##                                       religion_Muslim 
##                                                     4 
##                                religion_Non-believers 
##                                                     0 
##             social_class_The working class of society 
##                                                     0 
##        social_class_The lower middle class of society 
##                                                     0 
##              social_class_The middle class of society 
##                                                     0 
##        social_class_The upper middle class of society 
##                                                     0 
##              social_class_The higher class of society 
##                                                     3

We are seeing NAs because in not all countries are all the factors’ levels represented. For example we see that social_class_The higher class of society has 3 missing values. Let’s confirm that there are 3 countries in our dataframe with no observation for that level

final_data |> filter(social_class=="The higher class of society") |>  pull(country) |> unique() |> length()
## [1] 25

There are indeed 3 countries with non observation of social_class=="The higher class of society". Therefore those percentages should be 0s and not NAs.

data_factors <- data_factors |>
  mutate(across(everything(), ~replace_na(., 0)))

Joining all together

data_aggr <- data_aggr |> 
  left_join(data_factors, by = "country")

Lastly, joining the country level variables obtained from external sources

data_aggr <- data_aggr |> 
  left_join(country_level_data, by = c("country" = "iso2c")) |> 
  select(-c(iso3c, country.y))
dim(data_aggr)
## [1] 28 79

The resulting data frame has 28 observations and 79 variables.

Linear regression with Elastic Net

Model output compared below in predictive model performance section.

data_models <- data_aggr |> 
  select(-country)

set.seed(123)

# Define the leave-one-out cross-validation control
ctrl <- trainControl(
  method = "LOOCV",
)

# Defining Elastic net grid 
elastic_grid <- expand.grid(
  alpha = seq(0, 1, length = 10),  # Search from Ridge (0) to Lasso (1)
  lambda = 10^seq(-5, 1, length = 100)  # Regularization strength
)

# Train the Elastic Net model
elastic_net_model <- train(
  mean_trans_docs ~ ., 
  data = data_models, 
  method = "glmnet",
  preProcess = c("center", "scale"),
  trControl = ctrl,
  tuneGrid = elastic_grid,
  metric = "RMSE"
)

# Display the best alpha and lambda values
#print(elastic_net_model)
elastic_net_model$bestTune
##         alpha     lambda
## 756 0.7777778 0.02154435

Random forest

Model output compared below in predictive model performance section.

set.seed(123)

# Define LOOCV control
control <- trainControl(
  method = "LOOCV",
  returnResamp = "all"
)

# Define hyperparameter grid for Random Forest
rf_grid <- expand.grid(
  mtry = seq(2, ncol(data_models) - 1, by = 2)
)
# mtry the number of variables randomly sampled at each split

# Train Random Forest model with hyperparameter tuning
rf_model <- train(
  mean_trans_docs ~ ., 
  data = data_models, 
  method = "rf",
  preProc=c('scale','center'),
  trControl = control,
  tuneGrid = rf_grid,
  importance = TRUE   # Calculate variable importance
)

# Results
#print(rf_model)
plot(rf_model)

rf_model$bestTune
##    mtry
## 32   64

Gradient boosting

Model output compared below in predictive model performance section.

set.seed(123)

control <- trainControl(
  method = "LOOCV",
  returnResamp = "all"
)

gbm_grid <- expand.grid(
  n.trees = c(50, 100, 200),           # Number of trees
  interaction.depth = c(1, 2, 3),      # Tree depth
  shrinkage = c(0.01, 0.05, 0.1),      # Shrinkage
  n.minobsinnode = c(1, 3, 5)          # Minimum observations
)

gbm_model <- train(
  mean_trans_docs ~ ., 
  data = data_models, 
  method = "gbm",
  preProc = c('scale', 'center'),
  trControl = control,
  tuneGrid = gbm_grid,
  verbose = FALSE)

#print(gbm_model)
gbm_model$bestTune
##    n.trees interaction.depth shrinkage n.minobsinnode
## 70     200                 2       0.1              1

Predictive model performance

Here we analyse the benchmark model and 3 predictive models. We aggregate all the performance metrics in this section and provide commentary afterwards on what the best model is.

# Adding benchmark (mean)
benchFit <- lm(mean_trans_docs ~ 1, data=data_models)

benchmark_pred <- predict(benchFit, data_aggr)
elastic_pred <- predict(elastic_net_model, data_aggr)
rf_pred <- predict(rf_model, data_aggr)
gbm_pred <- predict(gbm_model, data_aggr)

# Actual values
actual <- data_aggr$mean_trans_docs

# Create a data frame with all predictions and actual values
all_predictions <- data.frame(
  country = data_aggr$country,
  actual = actual,
  benchmark = benchmark_pred,
  elastic_net = elastic_pred,
  random_forest = rf_pred,
  gradient_boosting = gbm_pred
)

# Calculate errors for each model
all_predictions$benchmark_error <- abs(all_predictions$actual - all_predictions$benchmark)
all_predictions$elastic_error <- abs(all_predictions$actual - all_predictions$elastic_net)
all_predictions$rf_error <- abs(all_predictions$actual - all_predictions$random_forest)
all_predictions$gbm_error <- abs(all_predictions$actual - all_predictions$gradient_boosting)

# Calculate performance metrics for each model
calculate_metrics <- function(actual, predicted) {
  residuals <- actual - predicted
  r_squared <- 1 - sum(residuals^2) / sum((actual - mean(actual))^2)
  rmse <- sqrt(mean(residuals^2))
  mae <- mean(abs(residuals))
  
  return(c(R_squared = r_squared, RMSE = rmse, MAE = mae))
}

# Get metrics for each model
benchmark_metrics <- calculate_metrics(actual, benchmark_pred)
elastic_metrics <- calculate_metrics(actual, elastic_pred)
rf_metrics <- calculate_metrics(actual, rf_pred)
gbm_metrics <- calculate_metrics(actual, gbm_pred)

# Combine metrics
model_metrics <- rbind(
  "Benchmark(=Mean)" = benchmark_metrics,
  Elastic_Net = elastic_metrics,
  Random_Forest = rf_metrics,
  Gradient_Boosting = gbm_metrics
)

# Print performance comparison
print(model_metrics)
##                   R_squared         RMSE          MAE
## Benchmark(=Mean)  0.0000000 0.2070649266 0.1741883440
## Elastic_Net       0.9469052 0.0477125549 0.0401216524
## Random_Forest     0.9583596 0.0422536351 0.0342401088
## Gradient_Boosting 0.9999953 0.0004477888 0.0003622869

We are able to predict so good partially because this is a set of observations which share many characteristics among them. Due to this low number of observations we are able to run LOOCV which yields amazing results. However, we should treat these results carefully, as the limited number of observation and such high performance metrics suggest that our models are prone to overfitting. Again this is mostly due to having few and similar observations, hence we are able to predict really well among those observations, but the generizability power of our model is quite low. We would need many more and many more diverse countries in our training dataset to build a model which could be applicable even outside Europe.

# Sort predictions by actual values (descending)
sorted_predictions <- all_predictions[order(all_predictions$actual, decreasing = TRUE), ]

# Print sorted predictions
print(sorted_predictions[, c("country", "actual", "benchmark", "elastic_net", "random_forest", "gradient_boosting")])
##    country    actual benchmark elastic_net random_forest gradient_boosting
## 9       ES 0.9033659 0.5965319   0.8524193     0.8194864         0.9024291
## 22      NL 0.8663265 0.5965319   0.8423528     0.8237708         0.8657931
## 21      MT 0.8526786 0.5965319   0.8221836     0.8140206         0.8530928
## 7       DK 0.7967742 0.5965319   0.7403161     0.7698585         0.7962633
## 6       DE 0.7906977 0.5965319   0.7787671     0.7781446         0.7906727
## 11      FR 0.7901639 0.5965319   0.7352544     0.7753641         0.7898748
## 19      LU 0.7884615 0.5965319   0.7664504     0.7745369         0.7889321
## 24      PT 0.7749726 0.5965319   0.6632592     0.7392899         0.7751905
## 16      IE 0.7324263 0.5965319   0.7264290     0.7254427         0.7321317
## 2       BE 0.7278107 0.5965319   0.7260266     0.7025686         0.7275521
## 26      SE 0.7217295 0.5965319   0.7265787     0.7347874         0.7220738
## 10      FI 0.7100793 0.5965319   0.7267072     0.7336228         0.7104929
## 12      GB 0.7014428 0.5965319   0.7396484     0.7284936         0.7023535
## 1       AT 0.6154650 0.5965319   0.6371225     0.6590789         0.6162024
## 8       EE 0.6127024 0.5965319   0.5584945     0.5718270         0.6124239
## 13      GR 0.5980498 0.5965319   0.6064764     0.5712756         0.5981619
## 4       CY 0.5388471 0.5965319   0.5638382     0.5100675         0.5388495
## 27      SI 0.5282609 0.5965319   0.5777653     0.5571274         0.5283034
## 23      PL 0.5000000 0.5965319   0.4459348     0.4521700         0.4993352
## 20      LV 0.4840614 0.5965319   0.4266426     0.4457641         0.4839159
## 17      IT 0.4769404 0.5965319   0.5510112     0.4899712         0.4773019
## 14      HR 0.4539474 0.5965319   0.5159239     0.4878352         0.4542064
## 5       CZ 0.4472252 0.5965319   0.4214334     0.4796742         0.4474706
## 18      LT 0.4265570 0.5965319   0.4801427     0.4407894         0.4273383
## 28      SK 0.2990971 0.5965319   0.3362504     0.2972059         0.2990450
## 25      RO 0.2386740 0.5965319   0.2716489     0.2768576         0.2392832
## 3       BG 0.1666667 0.5965319   0.2572985     0.2779446         0.1666693
## 15      HU 0.1594684 0.5965319   0.2065162     0.2533548         0.1592382
# Visualize predictions across countries
# Reshape the data for plotting

plot_data <- sorted_predictions %>%
  select(country, actual, elastic_net, random_forest, gradient_boosting) %>%
  gather(key = "model", value = "prediction", -country, -actual)

# Create plot
ggplot(plot_data, aes(x = reorder(country, -actual), y = prediction, fill = model)) +
  geom_bar(stat = "identity", position = position_dodge()) +
  geom_point(aes(y = actual), color = "black", size = 2) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = "Model Predictions vs Actual Values by Country",
       x = "Country", y = "Proportion Supporting Trans Document Changes") +
  scale_fill_brewer(palette = "Set1")

# Create a scatterplot of actual vs predicted for each model
ggplot(plot_data, aes(x = actual, y = prediction, color = model)) +
  geom_point(size = 3, alpha = 0.7) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
  facet_wrap(~model) +
  theme_minimal() +
  labs(title = "Actual vs Predicted Values by Model",
       x = "Actual Values", y = "Predicted Values")

# Find which model performed best for each country
all_predictions$best_model <- apply(all_predictions[, c("elastic_error", "rf_error", "gbm_error")], 1, 
                                   function(x) c("Elastic Net", "Random Forest", "Gradient Boosting")[which.min(x)])

# Count how many countries each model performed best on
table(all_predictions$best_model)
## 
## Gradient Boosting 
##                28

Overall, the random forest and gradient boost are the best models. The gradient boost model predicts better at the extreme values however i.e. countries with very high or very low support for the target variable. This is because of the class imbalance at these levels.

The random forest on the other hand, does not predict the more extreme country values as well.

Checks for overfitting in GBM

# Calculate residuals
gbm_residuals <- actual - gbm_pred

# Create a dataframe
residuals_df <- data.frame(actual, gbm_residuals)

# Plot residuals
ggplot(residuals_df, aes(x = actual, y = gbm_residuals)) +
  geom_point(color = "red", alpha = 0.7) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  theme_minimal() +
  labs(title = "Residuals Plot for Gradient Boosting",
       x = "Actual Values", y = "Residuals")

Variables importance

Commentary for all 3 models provided after the three plots.

Elastic Net:

# Extracting all coefficients of the best Elastic Net Model
coefficients <- coef(elastic_net_model$finalModel, s=elastic_net_model$bestTune$lambda)

# Printing all the non 0s
non0_coefficients <- data.frame(
  feature = rownames(coefficients),
  coefficient = as.numeric(coefficients))  |>  
  filter(coefficient != 0) |> 
  arrange(desc(abs(coefficient)))
non0_coefficients
##                                     feature   coefficient
## 1                               (Intercept)  5.965319e-01
## 2                          mean_gender_docs  1.085275e-01
## 3                        mean_roma_minority -3.258333e-02
## 4                     mean_antilgbtq_rights -2.203364e-02
## 5     `friends_trans_Refusal (SPONTANEOUS)` -1.689666e-02
## 6                       mean_nonEU_national  1.025104e-02
## 7               `phone_access_No telephone` -9.770012e-03
## 8           `life_sat_Not at all satisfied` -9.335941e-03
## 9                           democracy_index  3.171835e-03
## 10                     mean_minority_discri -3.107835e-03
## 11 `internet_use_Two or three times a week` -4.820864e-06

Random Forest:

# Plotting the top 10 most important variables 
varImp(rf_model, scale = F)$importance %>%
  as.data.frame() %>%
  tibble::rownames_to_column("variable") %>%
  arrange(desc(Overall)) %>%
  slice_head(n = 10) %>%
  ggplot(aes(x = reorder(variable, Overall), y = Overall)) +
    geom_bar(stat = "identity") +
    coord_flip() +
    labs(title = "Feature Importance - Random Forest",
       x = "Features", y = "Importance") +
    theme_minimal()

Gradient Boosting:

# Plotting the top 10 most important variables 
varImp(gbm_model, scale = F)$importance %>%
  as.data.frame() %>%
  tibble::rownames_to_column("variable") %>%
  arrange(desc(Overall)) %>%
  slice_head(n = 10) %>%
  ggplot(aes(x = reorder(variable, Overall), y = Overall)) +
    geom_bar(stat = "identity") +
    coord_flip() +
    labs(title = "Feature Importance - Gradient Boosting",
       x = "Features", y = "Importance") +
    theme_minimal()

The most important variables in the predictive models are consistent with what we have seen throughout the modelling process, from the simple logistic regression to explanatory mixed model, to predictive model.

Conclusions

After extensive variable selection, tuning and model testing. We found that the best explanatory model was a cross-level mixed model that was able to correctly classify around 84% of observations. We note however, this model is only slightly better than much more computationally efficient softwares. This includes the basic simple logistic regression which classified around 82% of observations correctly, and the basic mixed model which has only a random intercept and individual-level data, which classified just under 84% correctly. For this, it is our preferred explanatory model.

Across the models, in summary, some factors that were related to supporting the policy included: - having trans friends, being a non-believer, being an everyday internet user,

While some of the factors related to opposing the policy included: - expressing anti-LGBTQ+ views, expressing more discriminatory views against minorities, having lower life satisfaction and expressing a more right wing ideology.

We believe that most of these are logical connections to support a more progressive inclusion policy.

The models were less accurate for countries with higher class imbalance i.e. more extreme positive or negative views.